# Project 2

StackExchange is a multi-site platform for asking and answering questions. Users earn reputation when their posts (questions and answers) are upvoted (details here).

## Setup

Create a single repository called cinf401-project-2, and indicate in a README the members of the group. Submit your Java source code, your R code, and your R Markdown. Do not include your data files.

All users, posts, etc. have been data acquired from this torrent and uploaded in HDFS under /data/stackexchange (web view). Have a look at /bigdata/data/stackexchange/readme.txt (on delenn, not HDFS) to understand the fields in the various XML files.

Using Hadoop and MapReduce, plus minimal R processing, complete the three tasks below.

What is the age distribution of users across all sites as a whole? Be sure only unique users are counted (identified by AccountId field) and nobody is double-counted. Show your findings with an appropriate plot. Ensure your plot has easy-to-understand axis labels and all labels are readable.

What are the top 10 tags (by frequency of posts) for each stackexchange subsite? Show as a list or table for each subsite, in alphabetical order by subsite name.

Store your MR output in HDFS under a file path like /users/jeckroth/gp2/... (but obviously not jeckroth).