Assignment 3

The NYC Taxi Commission has published trip data for several years, found on delenn in /bigdata/data/taxi and /ssd/data/taxi (equivalent copies of the data; the former on RAID spinning disks, the latter on an SSD). The data is stored in CSV files named yellow_tripdata_2*.csv plus some compressed versions (gzip and lz4). The fifth column of the CSV files contains the distance traveled (in miles). In total, there are 1.4 billion trips (rows) in the data.

Find the fastest way to compute the total (sum) traveled distance across all years. The answer should be 7140380300 or equivalently 7.14038e+09. If you create data files as part of your procedure, the time required to do so must be included in your reported benchmark times. You can create your own intermediate files in /bigdata/data/users/[your username] or /ssd/users/[your username].

Submit code and benchmark results. Report on at least three significantly different techniques (some slow, some fast). Be sure to run a proper benchmark for each: repeat the benchmark three times and report the mean time, ensure the system is not otherwise under load, and report wall time (not system/cpu time).

For full credit, find at least one technique that computes the answer in less than 20 minutes. (FYI, the best time I managed was about 11 minutes.)

CINF 401 material by Joshua Eckroth is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Source code for this website available at GitHub.