Cookbook: Data sources
There are quite a number of “lists of data sources.” Here are some:
- A list of data sources as a Github repository.
- Kaggle has several competitions with corresponding datasets.
Go to Data Tools and Apps at census.gov and look at the top: The American FactFinder. Click “Download Center” and then the button with the same text. Follow the directions. You’ll ultimately download a ZIP with some CSV files.
National Climactic Data Center
- Storm events – Use the “Select State” or search feature on the bottom of this page, then choose your dates and weather types. After these selections, you’ll be given a table. On the top-left of that table is a CSV download link.
- Historical Weather – Choose a location and date (on the next page, you can choose a range of dates). On the next page, you see some weather stats. You click the top tab “Custom” to give a range of dates (range can be specified in drop-downs above the tabs); click “Get History” button to update. At the bottom of this page, there is a link for the CSV file. [Note: As of Feb 3, 2015, this site is broken.]
- Pokemon stats: pokemon.csv (Contributed by Jacob Hell.)
[TODO; any suggestions?]
- nfldb, a Python library that downloads NFL data and saves to a CSV file. Found by George Robbins.
Contributed by Marisa Gomez.
Major league baseball data can be found in the
pitchRx R package.
IPEDS Data Center
Visit the IPEDS Data Center. IPEDS provides data about US universities.
- Choose “Compare Institutions”
- Choose “Final release data” if asked
- Select institutions “By Groups > EZ Group”. Choose a group, e.g., US Only. Click “Search”.
- Next, select variables with the “By Variables” link above the table. Choose “Browse / Search Variables”.
- Drill down to find variables. Be sure to select each year, and select your variables for that year. Then click “Continue”.
- When ready, click “Output” at the top of the menu bar. You should see a table of CSV links.
- If asked “Include imputation variables?” at the top of the output table, check “No” (i.e., do not replace missing data with most likely value based on rest of data)
- IRS Tax Stats – Choose the type of stats you’re interested in; this may require several pages of drill-down. Eventually, you should come to a page of Excel file links. You can download a few you want, or download them all as we did in the Student Loans demo.
Lending Club loans
Lending Club is a “marketplace” where regular people offer money to lend other regular people. Lending Club has produced a set of data for loans made and loans requests rejected.
Later, we’ll (possibly) use this data in Hadoop. It is accessible on HDFS under:
If you want to download all the data, use these commands on delenn:
wget --no-check-certificate https://resources.lendingclub.com/LoanStats3a.csv.zip wget --no-check-certificate https://resources.lendingclub.com/LoanStats3b.csv.zip wget --no-check-certificate https://resources.lendingclub.com/LoanStats3c.csv.zip wget --no-check-certificate https://resources.lendingclub.com/RejectStatsA.csv.zip wget --no-check-certificate https://resources.lendingclub.com/RejectStatsB.csv.zip wget --no-check-certificate https://resources.lendingclub.com/LCDataDictionary.xlsx
Google has released data about their cluster machines. They have some questions they’d like researchers to answer.
We might look at these data when we learn about Hadoop.
GitHub provides a continuous stream of events from its millions of hosted projects. Look at their overview page for details.
The Million Song dataset contains detailed information about songs. It comes in HDF5 format, which is difficult for us to process with Hadoop.
Kaggle hosted a competition using these data. You were asked to predict which songs a user would enjoy, based on their ratings of other songs.
Text data requires special processing. It’s not trivial to compare two paragraphs or books of text. You must transform the text first, typically into numeric form. One of doing this is the “bag of words” approach, where you convert each document into a vector of word counts. The vector at position 1 might correspond to the word “the”, and the count associated would be the count of times “the” appeared in the document. The second word might be “furlough”, etc. These vectors would be very high-dimensional, like 10,000 dimensions to represent the various words encountered. To compare multiple documents against each other, they would all use the same vector–word correspondence. If some document does not have the word “furlough”, it would just put a 0 in its vector in that dimension.
When Enron broke up, the US government released all their emails. They are available here. There are about 500k messages from about 150 people, mostly senior management.
The UCI Machine Learning Repository has many datasets. One contains SMS Spam and Non-Spam (“Ham”). SMS is txt messages. A good use of this dataset is to train machine learning models to recognize SMS spam.
Westbury Lab Usenet
Westbury Lab has released a dump of Usenet data (Usenet is like web forums; it began before there was the web). It’s about 37GB uncompressed.
We might look at it when we learn about Hadoop. It’s stored in HDFS at
Project Gutenberg ebooks
Download the book metadata in RDF format:
Then download all the books (English, TXT format only):
wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes=txt&langs=en"
That command waits 2 seconds between each request. This is recommended by Project Gutenberg. With 45k+ books, the crawl takes some time.
Google has also released US trademark data.