Using Backblaze’s hard drive data, on delenn at
/bigdata/data/backblaze, write a report with the following analyses:
- Produce a graph of total drive capacity (in PB) of non-failing drives per day over all days 2013-2016 (the 2013 data starts in April, btw). You’ll see some anomalies (very low/high values), don’t worry about that (but see the notes below about very large drives).
- What is the annual (not daily) failure rate of drives for years 2013-2016? Compute annual failure rate according to the description (PDF) provided by Backblaze. Show with a plot with the annual failure rate for each year.
- What is the annual failure rate, on average, for each drive manufacturer, across all years? Find the manufacturer by extracting the first characters of the model name, up to but not including the first space or digit. Show with a bar chart, with the bar heights increasing left-to-right.
- Bin the SMART 187 raw values into the same bins as shown in Backblaze’s report (first plot). Use the
cutR command as described in the R cookbook. Then plot the annual failure rate per bin, with increasing bin values left-to-right. You may omit records from drives that report NA for the SMART metric.
- Argue with an appropriate statistical test that the SMART 187 raw value (read errors) is higher on average for failing drives than non-failing drives; i.e., that this SMART metric is a good indicator of imminent failure.
- Do not add the data files to git. Automatic failing grade if you do so. The internet weeps.
- Since this project allows groups of two, you may want to divide the tasks listed above and work on them in parallel.
- Create a single repository called
cinf401-project-1, and indicate in a README the members of the group.
- Eliminate drives with capacity > 1e14 since these are bogus values.
- Write notes about how you processed the data files, if you did so outside of R. Show the commands used.
- Write a report in RMarkdown, hiding all unnecessary R code. You absolutely must avoid dumping large tables of numbers. Show only small tables when necessary, otherwise show plots or write English.
- Include the RMarkdown source and rendered form in your repository. This way I don’t have to reprocess everyone’s data files to generate the report.