Home

Cookbook: Data sources

Other lists

There are quite a number of “lists of data sources.” Here are some:

US Census

Go to Data Tools and Apps at census.gov and look at the top: The American FactFinder. Click “Download Center” and then the button with the same text. Follow the directions. You’ll ultimately download a ZIP with some CSV files.

Weather

National Climactic Data Center

Weather Underground

Biology, wildlife

Health

[TODO; any suggestions?]

Sports

NFL

MLB

Contributed by Marisa Gomez.

Major league baseball data can be found in the pitchRx R package.

library(pitchRx)
head(pitches)
animateFX(pitches, layer=list(facet_grid(pitcher_name ~ stand, labeller=label_both), theme_bw(), coord_equal()))
strikes <- subset(pitches, des == "Called Strike")
strikeFX(strikes, geom="tile", layer=facet_grid(.~stand))

Education

IPEDS Data Center

Visit the IPEDS Data Center. IPEDS provides data about US universities.

Financial

IRS

Lending Club loans

Lending Club is a “marketplace” where regular people offer money to lend other regular people. Lending Club has produced a set of data for loans made and loans requests rejected.

Later, we’ll (possibly) use this data in Hadoop. It is accessible on HDFS under: /datasets/lendingclub

If you want to download all the data, use these commands on delenn:

wget --no-check-certificate https://resources.lendingclub.com/LoanStats3a.csv.zip
wget --no-check-certificate https://resources.lendingclub.com/LoanStats3b.csv.zip
wget --no-check-certificate https://resources.lendingclub.com/LoanStats3c.csv.zip
wget --no-check-certificate https://resources.lendingclub.com/RejectStatsA.csv.zip
wget --no-check-certificate https://resources.lendingclub.com/RejectStatsB.csv.zip
wget --no-check-certificate https://resources.lendingclub.com/LCDataDictionary.xlsx

Internet census

In 2012, a group scanned the whole internet and produced a census. More info is available in their report, and this email message.

It can be downloaded here. It’s quite large (568GB), and uncompressed is 9TB. You may want to user their data browser to get an idea what’s there.

Google Cluster

Google has released data about their cluster machines. They have some questions they’d like researchers to answer.

More information may be found here. The data is downloaded using Google’s gsutil tool, as described in this document.

We might look at these data when we learn about Hadoop.

GitHub Archive

GitHub provides a continuous stream of events from its millions of hosted projects. Look at their overview page for details.

Million Songs

The Million Song dataset contains detailed information about songs. It comes in HDF5 format, which is difficult for us to process with Hadoop.

Kaggle hosted a competition using these data. You were asked to predict which songs a user would enjoy, based on their ratings of other songs.

Stack Exchange

Get a dump of the full Stack Exchange site with this torrent. You can also explore the data with their online query system.

Text

Text data requires special processing. It’s not trivial to compare two paragraphs or books of text. You must transform the text first, typically into numeric form. One of doing this is the “bag of words” approach, where you convert each document into a vector of word counts. The vector at position 1 might correspond to the word “the”, and the count associated would be the count of times “the” appeared in the document. The second word might be “furlough”, etc. These vectors would be very high-dimensional, like 10,000 dimensions to represent the various words encountered. To compare multiple documents against each other, they would all use the same vector–word correspondence. If some document does not have the word “furlough”, it would just put a 0 in its vector in that dimension.

Enron emails

When Enron broke up, the US government released all their emails. They are available here. There are about 500k messages from about 150 people, mostly senior management.

SMS Spam

The UCI Machine Learning Repository has many datasets. One contains SMS Spam and Non-Spam (“Ham”). SMS is txt messages. A good use of this dataset is to train machine learning models to recognize SMS spam.

Westbury Lab Usenet

Westbury Lab has released a dump of Usenet data (Usenet is like web forums; it began before there was the web). It’s about 37GB uncompressed.

We might look at it when we learn about Hadoop. It’s stored in HDFS at /datasets/westburylab-usenet.

Project Gutenberg ebooks

Project Gutenberg is a huge collection of scanned out-of-copyright books. You can download all of their books by following their instructions. This is the process:

Download the book metadata in RDF format:

wget http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2

Then download all the books (English, TXT format only):

wget -w 2 -m -H "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"

That command waits 2 seconds between each request. This is recommended by Project Gutenberg. With 45k+ books, the crawl takes some time.

Patents

Google has released all their US patent data, which makes up their Google Patent search.

Trademarks

Google has also released US trademark data.

CINF 401 material by Joshua Eckroth is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Source code for this website available at GitHub.