Note: See the corresponding lecture notes about R. This page has cookbook recipes.
- Someone else’s Cookbook for R
Avoiding console spamming, setting console width
> getOption("max.print")  99999 > options(max.print = 100) > getOption("max.print")  100
> getOption("width")  80 > options(width = 120) > getOption("width")  120
Running examples from documentation
Most functions have examples at the bottom when you read their help pages (e.g.,
?prop.test). You can run these examples with the
Load big datasets across lots of CSV files
Contributed by Matt Samuels.
Clean up ZIP codes
Chop off +4 portion:
Convert a date to season
my.dates <- as.Date("2011-12-01", format = "%Y-%m-%d") + 0:60 head(getSeason(my.dates), 24)  "Fall" "Fall" "Fall" "Fall" "Fall" "Fall" "Fall"  "Fall" "Fall" "Fall" "Fall" "Fall" "Fall" "Fall"  "Winter" "Winter" "Winter" "Winter" "Winter" "Winter"
Convert a vector into “bins”
cut function allows you to rewrite a vector of values (as you might find in a data frame column) into particular “bins”. The example below takes a vector of values between 2 and 14 and returns a vector with each value replaced with the corresponding bin. You can set the bin limits (“breaks”) to whatever you want. You can also create labels for the bins; be sure you create the right number of labels for all the bin intervals.
“Transpose” a data frame
InsectSprays into a form with A, B, C, … columns. Contributed by Matt Samuels. Later updated to handle several
Run Python code from R
Contributed by Christian Decker.
You can also call R code from Python using the Python module rpy2.
Contributed by Michael Clay.
Contributed by Tom Wright.
Graph (network) manipulation
Contributed by John Salis.
GUIs with Tcl/Tk
Contributed by George Robbins.
Run queries against Google BigQuery
Contributed by Jacob Hell.
Google BigQuery is a SQL-like query system for huge datasets, stored on Google servers. Read more about BigQuery.
Here are some resources:
Read an image into a matrix
img matrix will have dimensions (given by
…meaning: 500 height, 375 width, 3 color channels (RGB).
To get particular pixel color values (for RGB separately):
You can convert the matrix to a data frame if you’d like, with each row representing a different pixel (x, y, r, g, b):
Crop or resize an image
Assuming you’ve loaded an image into a matrix, as above, you can crop very easily:
Note, in computer graphics, the y-coordinate increases as you go down the screen. So
topY < bottomY in numeric value.
To resize, use the function below. Note, it needs the
EBImage library which must be installed like so:
This function takes an argument giving the maximum width or height (if the image is taller than wide, it will resize so the height is the specified
rsize; and vice versa):
dplyr SQL-style queries
Contributed by Chris Finkle.
Yet another package developed by Hadley Wickham, dplyr is a data manipulation tool designed specifically to work with dataframes and even SQL databases (hence the d in the name).
It has five main functionalities which work in a similar manner to the functions we’re used to, but have better names and more intuitive inputs:
- Filter (pulls out rows that meet criteria)
- Select (pulls out specified columns; think SQL)
- Arrange (puts things in a specified order)
- Mutate (creates new values based on old ones)
- Summarise (like Aggregate, but cooler)
- Group_by (used with Summarise)
The other big selling point is the pipe-style input using %>%, which increases R code’s readability and flexibility.
Let’s see some examples! First, how about Filter? Let’s look at the reshape library’s Tips dataset. Suppose we want to look at only the rows in which there were smokers in the party.
suppressMessages(library(reshape)) suppressMessages(library(dplyr)) head(tips) filter(tips, smoker=="Yes") %>% head()
What about just on the weekdays?
filter(tips, smoker=="Yes", day %in% c("Thu", "Fri")) %>% head()
You’ll notice that in the previous example I used ‘head’ with the %>% pipe. This can be used to chain together an arbitrary number of commands, even ones that aren’t in dplyr. Let’s demonstrate Select using it. Suppose we only care about total_bill, tip, party size, and time of the meal, and then only when the bill was more than $20 or the tip was more than $3 (for some reason).
tips %>% select(total_bill, tip, size, time) %>% filter(total_bill > 20 | tip > 3) %>% head()
Now what if we want to arrange them in descending order of tip amount? Clearly such rearrangements have created problems in the past, since we needed a whole blog post to do them on assignment three. Not so here! Arrange is very straightforward:
tips %>% select(total_bill, tip) %>% arrange(desc(tip)) %>% head()
(okay so if you want to do absolutely anything with preservation of rownames, no dice. But rownames suck anyway).
Now let’s try mutating. Remember, Mutate produces new values based on the ones you already have. An obvious choice on this dataset is finding what percentage the party tipped.
#this just prints the new result tips %>% select(total_bill, tip) %>% mutate(percentage = tip/total_bill * 100) %>% head() #this stores it pct_tips <- tips %>% mutate(percentage = tip/total_bill*100) head(pct_tips)
Finally, let’s look at Summarise and Group_by, which can do a lot of the heavy lifting we’ve relied on melt, dcast, and aggregate for. Specifically, let’s try doing a couple of the practice problems on the R page on the website. (The ones with aggregate as the recommended method)
#Dataframe 1 tips %>% group_by(sex, day) %>% summarise_each(funs(mean), total_bill, tip) %>% head() #Dataframe 2 tips %>% group_by(sex, day) %>% summarise(num_tips = n()) %>% arrange(day) %>% head() #or (simpler but omits column name) tips %>% group_by(sex, day) %>% tally() %>% arrange(day) %>% head() #You can even do some mild reshaping by using group_by without summarise: behold data frame 4 tips %>% group_by(day) %>% select(smoker) %>% table() %>% head()
There are many more things you can do with dplyr, including window functions (which return a vector of values, e.g. lag or lead functions, cumulative aggregates), random samples, and connecting to proper SQL databases (select and filter commands can be converted directly into SQL by piping them into %>% explain(). It’s really neat!). For more information and useful links, visit this tutorial that I pretty much ripped off wholesale for this presentation.
Contributed by Matt Klumb.
library("RMySQL") #Connects to database this example uses a local. mydb = dbConnect(MySQL(), user='root', password='password', dbname='R', host='127.0.0.1', port=3305) #Executes Query to pull information form database. q1 <- dbSendQuery(mydb, "select * from student") data <- fetch(q1) data ## id student_firstname student_lastname student_degreeID ## 1 1 Bill Lopez 2 ## 2 2 Frank Johnson 2 ## 3 3 Roger Dogger 1 ## 4 4 Mike Trueman 1 q2 <- dbSendQuery(mydb, "select student_lastname from student") data2 <- fetch(q2) data2 ## student_lastname ## 1 Lopez ## 2 Johnson ## 3 Dogger ## 4 Trueman #Executes left join to pull data from two tables. q3 <- dbSendQuery(mydb, "select student_firstname, degree_name FROM student LEFT JOIN degree ON student.student_degreeID = degree.degree_Id;") data3 <- fetch(q3) data3 ## student_firstname degree_name ## 1 Roger CIS ## 2 Mike CIS ## 3 Bill HIST ## 4 Frank HIST #Sends information into the Database using insert. q4 <- dbSendQuery(mydb, "INSERT INTO student VALUES (NULL,'Mike','Trueman',1);") data ## id student_firstname student_lastname student_degreeID ## 1 1 Bill Lopez 2 ## 2 2 Frank Johnson 2 ## 3 3 Roger Dogger 1 ## 4 4 Mike Trueman 1 dbDisconnect(mydb) ##  TRUE
Contributed by Isaac Sarmiento.
Sound with 440Hz and followed by 220Hz:
Wav <- bind(sine(440), sine(220)) show(Wav) plot(Wav) plot(extractWave(Wav, from = 1, to = 500)) waspec <- periodogram(Wav,normalize=T,width=64)
The colors represent the most important acoustic peaks for a given time frame, with red representing the highest energies, then in decreasing order of importance, orange, yellow, green, cyan, blue, and magenta, with gray areas having even less energy and white areas below a threshold decibel level.
Now for MP3
mp <- readMP3("Tribe.mp3") mp summary(mp) plot(mp) mpmono <- mono(mp,"right") dmpmono <-downsample(mpmono,20000) summary(dmpmono) wmp <- periodogram(dmpmono,normalize=T,width=64) image(wmp,ylim=c(0,2000))
Twitter with R
Contributed by Ou Zheng.
#key words library(twitteR) options(httr_oauth_cache=T) api_key <- "..." api_secret <- "..." access_token <- "..." access_token_secret <- "..." setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret) searchTwitter("iphone") #user library(twitteR) Tweets <- userTimeline("realDonaldTrump", n=100) head(Tweets) #for follower library(twitteR) StetsonU <- getUser('StetsonU') follow.su <- StetsonU$getFollowers(n=500) df.su <- do.call('rbind',lapply(follow.su,as.data.frame)) #statusesCount #followersCount #friendsCount #created df.sub <- subset(df.su,friendsCount<2300 & followersCount<3000 & statusesCount<10000) #df.sub <- df.su df.sub$time <- as.Date(df.sub$created) df.sub$ntime <- as.numeric(df.sub$time) library(ggplot2) p <- ggplot(df.sub,aes(x=time)) p + geom_histogram(fill='red',colour='black',binwidth=30) p <- ggplot(df.sub,aes(x=friendsCount)) p + geom_histogram(fill='red',colour='black',binwidth=30) p <- ggplot(data=df.sub,aes(x=friendsCount,y=followersCount)) p + geom_point(aes(size=statusesCount,colour=ntime),alpha=0.8) #source library(dplyr) library(purrr) library(twitteR) api_key <- "..." api_secret <- "..." access_token <- "..." access_token_secret <- "..." setup_twitter_oauth(api_key, api_secret, access_token, access_token_secret) # We can request only 3200 tweets at a time; it will return fewer # depending on the API trump_tweets <- userTimeline("realDonaldTrump", n = 100) trump_tweets_df <- tbl_df(map_df(trump_tweets, as.data.frame)) library(tidyr) #We clean this data a bit, extracting the source application. (We’re looking #only at the iPhone and Android tweets- a much smaller number are from the web #client or iPad). tweets <- trump_tweets_df %>% select(id, statusSource, text, created) %>% extract(statusSource, "source", "Twitter for (.*?)<") %>% filter(source %in% c("iPhone", "Android")) library(lubridate) library(scales) library(ggplot2) #Overall, this includes 628 tweets from iPhone, and 762 tweets from Android. #One consideration is what time of day the tweets occur, which we’d expect to #be a “signature” of their user. Here we can #certainly spot a difference: tweets %>% count(source, hour = hour(with_tz(created, "EST"))) %>% mutate(percent = n / sum(n)) %>% ggplot(aes(hour, percent, color = source)) + geom_line() + scale_y_continuous(labels = percent_format()) + labs(x = "Hour of day (EST)", y = "% of tweets", color = "")
###Basic String Operations
Contributed by Chris Finkle.
In R, the default handling of strings is not terribly easy or intuitive. Stringr makes things better. It has some basic string functionalities and also makes using regex much easier.
str_c replicates R’s
paste functionality but with some added options.
library(stringr) str_c("Letter: ", letters) str_c("Letter", letters, sep = ": ") str_c(letters, " is for", "...") str_c(letters[-26], " comes before ", letters[-1]) str_c(letters, collapse = "") str_c(letters, collapse = ", ")
str_sub is like
substr except it understands negative indices (you may be familiar with these from Python)
ex <- "This is an example string" str_sub(ex, 5, 10) str_sub(ex, -1) str_sub(ex, -10) str_sub(ex, end = -10)
str_trim removes whitespace from either end;
str_pad adds it (cf. the infamous Leftpad)
str_detect returns a logical vector based on detection (or not) of a specified pattern.
ex <- c("This", "is", "an", "example", "vector") str_detect(ex, "is") str_detect(ex, "^[aeiou]")
str_locate works much like
regexpr, returning a numeric matrix with the indices at which patterns occur.
ex <- c("This", "is", "an", "example", "vector", "ooooh") str_locate(ex, "is") #str_locate_all works like gregexp and returns a list of matrices for each string searched str_locate_all(ex, "[aeiou]+")
str_extract actually takes out the matching pattern.
str_match does so using capture groups.
str_replace replaces the matching text.
ex <- c("Star Wars", "Battlestar Galactica", "The secret of Eckroth's server names is they're Babylon 5 characters", "Star Trek") str_extract(ex, "[Ss]tar ([:alpha:]+)") str_match(ex, "[Ss]tar ([:alpha:]+)") #[,1] is the whole pattern, [,2] is the first capture group str_replace(ex, "[Ss]tar ([:alpha:]+)", "Star Wars Holiday Special")