# Project 1

Using Backblaze’s hard drive data, on delenn at /bigdata/data/backblaze, write a report with the following analyses:

• Produce a graph of total drive capacity (in PB) of non-failing drives per day over all days 2013-2016 (the 2013 data starts in April, btw). You’ll see some anomalies (very low/high values), don’t worry about that (but see the notes below about very large drives).
• What is the annual (not daily) failure rate of drives for years 2013-2016? Compute annual failure rate according to the description (PDF) provided by Backblaze. Show with a plot with the annual failure rate for each year.
• What is the annual failure rate, on average, for each drive manufacturer, across all years? Find the manufacturer by extracting the first characters of the model name, up to but not including the first space or digit. Show with a bar chart, with the bar heights increasing left-to-right.
• Bin the SMART 187 raw values into the same bins as shown in Backblaze’s report (first plot). Use the cut R command as described in the R cookbook. Then plot the annual failure rate per bin, with increasing bin values left-to-right. You may omit records from drives that report NA for the SMART metric.
• Argue with an appropriate statistical test that the SMART 187 raw value (read errors) is higher on average for failing drives than non-failing drives; i.e., that this SMART metric is a good indicator of imminent failure.

• Create a single repository called cinf401-project-1, and indicate in a README the members of the group.