Project 5

Define your own data analysis problem. See below for requirements. Submit your final code & report via git by 5pm, May 6. During our “final exam time” (May 6 5-7pm), you and your partner (if you have one) are required to present your findings. Your grade will depend on the quality of your report and presentation and the appopriateness and insight in your analysis. As a final requirement, you must complete the “decision matrix” (below) that shows how you decided to use certain tools for your analysis. You will explain your decision matrix during your presentation.

This project is 25% of your grade.


Decision matrix

Download or recreate this table and mark each cell in which you used a specific tool for a specific task during your analysis. Be able to explain during your presentation why you chose that tool for that task and why not other tools.

Task Unix tools Excel R MySQL BigQuery MapReduce Spark Spark MLlib Weka OpenCV
Data acquisition                    
Exploratory analysis                    
SQL-like queries                    
Distributed workers                    
Numeric/string processing                    
Machine learning                    
Image processing                    

For example, here are my choices for the Stars assignment:

Task Unix tools Excel R MySQL BigQuery MapReduce Spark Spark MLlib Weka OpenCV
Data acquisition X                  
Exploratory analysis X                 X
Plotting     X              
SQL-like queries                    
Distributed workers             X      
Numeric/string processing             X      
Machine learning                    
Image processing                   X

My reasoning follows:

Grading rubric

This project is worth 25% of your overall grade. Your grade for this project is broken down as described below. Note, if I have reason to believe you did not contribute as much as your partner, your grade will be lower.

Component Portion of grade Criteria
Report 40% Well-written English, appropriateness of analysis, evidence for your conclusions, inclusion of clear and simple plots and/or tables, all code hidden
Presentation 40% Easy to understand, evidence for your conclusions, clear visuals, ability to answer questions
Decision matrix 20% Completeness and ability to answer questions

Notice that you are not graded on your code. You must submit your code, but your grade does not depend on the quality of your code.

CINF 401 material by Joshua Eckroth is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Source code for this website available at GitHub.