On the R command line, you can use ? syntax to open documentation about functions.
?topic — open a manual page about a topic or function
??topic — search manual pages for mentions of the topic; use ?specific_topic after you find the right one
You can ask about operators also: ?\`+\`
Many of the features we will use in R are actually defined in external libraries. For example, string manipulation is provided by the stringr library.
Here are the necessary commands:
install.packages("stringr") — install a library (or “package”) if you have not already
library(stringr) — load a library so you can use its functions; notice stringr is not in quotes, unlike install.packages
Sometimes you’ll see require(stringr) or similar instead of library(stringr). These two functions basically do the same thing (load a library) but the library function is a better choice. See this blog post for an explanation.
Variable assignment should be done like this: x <- 5.
Or, you can do this: 5 -> z.
The data structures and data types in R are the most important features to understand.
R has these main types (classes actually):
numeric — a.k.a. floats, a.k.a. decimal values; integer exists, too, if you write 5L instead of 5, but it’s very uncommon (because we don’t do for loops in R, by the way)
character — a.k.a. strings
Date or POSIXct
logical — a.k.a. booleans (TRUE or T, and FALSE or F)
Note, regarding T and F shorthands for TRUE and FALSE logical values. R for Everyone has an interesting statement about this:
R provides T and F as shortcuts for TRUE and FALSE, respectively, but it is best practice not to use them, as they are simply variables storing the values TRUE and FALSE and can be overwritten (!!), which can cause a great deal of frustration… (!! added by me)
You can query the type of a variable/value:
Vectors contain values of the same type, much like arrays in other languages. Here are some vectors:
R is a “vector language.” Most functions can be applied to an entire vector at once (and the function operates on each element). This is how we avoid explicit loops.
R has a list type, as well, which can hold values of different types. Most functions expect vectors, so we don’t often, if ever, use lists outright.
Here are some vector operations.
What is c()?
From ?c inside R:
This is a generic function which combines its arguments.
The default method combines its arguments to form a vector. All arguments are coerced to a common type which is the type of the returned value, and all attributes except names are removed.
So it’s a simple way to make a vector.
More about vectors
Almost everything in R is a vector. Even a simple number is a vector. See how it’s printed?
That  is giving index values for the vector. This vector has one thing in it. You can also index into it:
And you can ask for its length:
Even the output of length is a vector!
Anyway, those index numbers in the left column are more intuitive when you have many elements. We can produce many elements easily with : operator:
The point about “everything is a vector” is that if you ask the type of a vector, it will give the type of its values.
A vector can contain NA values, which are like “null” but not the same.
Use the is.na function to ask if each value in a vector is NA or not. any.na gives you one answer about the whole vector. na.omit returns the vector without NAs. It also includes metadata about which indices were omitted.
NA is the appropriate value (or non-value) for missing data.
NULL cannot be put in a vector. It’s like void, except that a variable can equal NULL. Sometimes, you can give a function NULL for one of its arguments, and a function can return NULL. That’s all it’s good for.
Factors are vectors with the distinct values stored as metadata called “levels.” They are like an array of “enum” types in Java. The vector itself actually only contains integers, indicating which of the unique levels that value represents.
You can convert a vector into a factor like so:
Here is a factor of character types:
The factor prints as if it was the original vector, but if you use as.integer you can see the internal numeric IDs and the levels that correspond to the IDs:
Factors are most useful to us when we are creating plots with ggplot.
A data frame is like a spreadsheet. Rows and columns can be named. Typically, just columns are named. Internally, each column is a vector (of row values), and of course they are all the same length (the number of rows). Since each column is a different vector, each column can hold a different type of data (but only one type of data per column).
Let’s build a data frame:
Notice it gave us silly names for the columns. Let’s name them ourselves:
Also notice the output shows row numbers, which can be used as indices.
You can ask about the number of rows and columns using nrow and ncol. Note that length of this data frame would give 3 because it has three columns (a data frame is a list of column vectors). The dim function gives the rows and columns. The names function gives the column names (as a vector of character).
You can attach more columns with cbind:
The rbind function can add rows:
You can cbind or rbind two data frames as well:
Subsetting and filtering
Use dframe[row,] to access a particular row in dframe (a data frame)
Use dframe[,col] or dframe[[col]] or dframe$col to access a particular column
Use dframe[row,col] to access a particular cell; note, a 1-element vector will result
Use dframe[row,col,drop=FALSE] to get a particular cell as a data frame
You can also subset a data frame by complex boolean expressions:
Reshaping data frames (melt and dcast)
The reshape2 package provides some powerful functions for dramatic transforms of data frames. These transformations come in two forms (which are inverses of each other): melting and casting.
Both melting and casting assume that your data frames consist only of “identifier” and “measured” variables or columns:
Identifier (id) variables are those that identify cases that have been measured. For example, id variables may be the person’s first name and last name plus date of birth.
Measured variables are those that are measured per case. A person’s height or weight or GPA would be measure variables since they do not identify the case (the person) but are measures of that person.
You can also think about id variables as those you might put on the x-axis, and measure variables as those you might put on the y-axis.
Often we have data frames that look like the following, predefined data frame USArrests. We use the head function here to look at the first few rows.
The first thing we’re going to do (which has nothing to do with melting/casting) is put the row names (the states) into their own column, and then remove row names.
Next, we’ll “melt” the data frame. The id column is “State”, the measured columns are “Murder”, “Assault”, “UrbanPop”, and “Rape”.
Notice how each of the measure variables is on a row of its own, and we have new columns “variable” and “value”.
This format is easier to use with ggplot, which we’ll see later.
After “melting”, we can “cast” the melted data frame to all kinds of different forms. The dcast (d for data frame) works as follows:
Formulas are written like this (a few variations listed):
The parts before the ~ become id columns in the resulting data frame, and the parts after the ~ become the measure columns. A + means make 2+ columns, just like a “truth table” where A + B means make columns A and B and list the rows so that for each value of A, go through all values of B.
The special syntax ... means “all variables not already listed” and . means “no variable”.
If a formula results in multiple values for each row (because you didn’t mention all id variables, for example), then you need to provide an “aggregating” function, e.g., mean to average the multiple values. If you do not provide such a function, length will be used, meaning it will count how many values match the formula.
Note, dcast assumes the values are found in the values column, as produced by melt.
Here is another built-in data frame:
Let’s melt it on id variables “Chick”, “Diet”, and “Time”:
This casting gives the mean of “Time” vs. “variable” (which is only “weight”):
Here we have “Time” vs. “Diet”. The “Diet” unique values become columns.
If we don’t provide mean as the aggregator, we’ll get a warning and it will default to length. This is because for each “Diet” value (1-4), there are 10-20 chicks and therefore 10-20 weight measures.
If you use library(plyr), you can also do subsets. Notice the subset = .(Time < 10) part.
Aggregation on data frames
A different way to produce column means or sums or whatever, without using melt and dcast, is to use aggregate.
We’ll use the ChickWeight data frame again.
aggregate uses “formulas”, too, like dcast, but aggregate’s formulas are written the other way:
Switching data frames to diamonds inside the ggplot2 library:
Let’s find the maximum carat per cut:
If you want two measured columns, use cbind():
Next we’ll find the count of diamonds in the data frame with various clarities. We’ll use a bogus column cut just to do the aggregation, but use length to count how many cut values there are for each clarity. We could have used any column that’s not clarity to count up the same way.
Take the tips data frame:
And produce these data frames with melt and dcast:
Now do the following with aggregate:
Merging data frames
We can combine or “merge” two data frames in a way similar to a relational database. You must specify a column (or columns) in both data frames that acts as the “key”.
Suppose we have these two data frames (from the example documentation ?merge):
We can create a new, merged data frame by combining the two on the “surname” column in authors and the “name” column in books:
Note that “R Core” is in authors but not books, so it’s left out of the merge. The all=TRUE option keeps it:
You can merge on a key composed of 2+ columns if they’re named the same in both data frames. Here are two new data frames: