R
Note: See the corresponding cookbook about R. This page has lecture notes.
Help system
On the R command line, you can use ?
syntax to open documentation about functions.
?topic
— open a manual page about a topic or function??topic
— search manual pages for mentions of the topic; use?specific_topic
after you find the right one
You can ask about operators also: ?\`+\`
Libraries
Many of the features we will use in R are actually defined in external libraries. For example, string manipulation is provided by the stringr
library.
Here are the necessary commands:
-
install.packages("stringr")
— install a library (or “package”) if you have not already -
library(stringr)
— load a library so you can use its functions; noticestringr
is not in quotes, unlikeinstall.packages
Sometimes you’ll see require(stringr)
or similar instead of library(stringr)
. These two functions basically do the same thing (load a library) but the library
function is a better choice. See this blog post for an explanation.
Variables
Variable assignment should be done like this: x <- 5
.
Or, you can do this: 5 -> z
.
Data structures
The data structures and data types in R are the most important features to understand.
Data types
R has these main types (classes actually):
numeric
— a.k.a. floats, a.k.a. decimal values;integer
exists, too, if you write5L
instead of5
, but it’s very uncommon (because we don’t dofor
loops in R, by the way)character
— a.k.a. stringsDate
orPOSIXct
logical
— a.k.a. booleans (TRUE
orT
, andFALSE
orF
)
Note, regarding T
and F
shorthands for TRUE
and FALSE
logical values. R for Everyone has an interesting statement about this:
R provides
T
andF
as shortcuts forTRUE
andFALSE
, respectively, but it is best practice not to use them, as they are simply variables storing the valuesTRUE
andFALSE
and can be overwritten (!!), which can cause a great deal of frustration… (!! added by me)
You can query the type of a variable/value:
> class(5)
[1] "numeric"
> typeof(5)
[1] "double"
> class(5L)
[1] "integer"
> typeof(5L)
[1] "integer"
> class("foo")
[1] "character"
> typeof("foo")
[1] "character"
> class(TRUE)
[1] "logical"
> typeof(TRUE)
[1] "logical"
Vectors
Vectors contain values of the same type, much like arrays in other languages. Here are some vectors:
v1 <- c(3, 7, 12)
v2 <- c("foo", "bar", "baz", "quux")
v3 <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
R is a “vector language.” Most functions can be applied to an entire vector at once (and the function operates on each element). This is how we avoid explicit loops.
R has a list
type, as well, which can hold values of different types. Most functions expect vectors, so we don’t often, if ever, use lists outright.
Here are some vector operations.
v1 <- c(3, 7, 12)
v1 + 5 # +5 is applied to each element
v1 * 5
v1 / 5
sqrt(v1)
v4 <- c(8, 1, 2) # same length as v1
v1 + v4 # works element-by-element
v1 * v4
v1 / v4
v1 ^ v4
v1 < v4
length(v1)
all(v1 < v4)
any(v1 < v4)
max(v1)
min(v1)
mean(v1)
summary(v1)
# get individual elements, or ranges
v1[1] # oh no! indexing starts at 1 !!!
v1[1:3] # a range of elements
v1[c(1, 3)] # get two specific elements
What is c()
?
From ?c
inside R:
This is a generic function which combines its arguments.
The default method combines its arguments to form a vector. All arguments are coerced to a common type which is the type of the returned value, and all attributes except names are removed.
So it’s a simple way to make a vector.
More about vectors
Almost everything in R is a vector. Even a simple number is a vector. See how it’s printed?
> 5
[1] 5
That [1]
is giving index values for the vector. This vector has one thing in it. You can also index into it:
> 5[1] # whoa!
[1] 5
And you can ask for its length:
> length(5)
[1] 1
Even the output of length
is a vector!
Anyway, those index numbers in the left column are more intuitive when you have many elements. We can produce many elements easily with :
operator:
> 1:10
[1] 1 2 3 4 5 6 7 8 9 10
> 50:-50
[1] 50 49 48 47 46 45 44 43 42 41 40 39 38 37 36 35 34 33
[19] 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15
[37] 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 -1 -2 -3
[55] -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16 -17 -18 -19 -20 -21
[73] -22 -23 -24 -25 -26 -27 -28 -29 -30 -31 -32 -33 -34 -35 -36 -37 -38 -39
[91] -40 -41 -42 -43 -44 -45 -46 -47 -48 -49 -50
The point about “everything is a vector” is that if you ask the type of a vector, it will give the type of its values.
# we'll use seq(), a generalization of :
> seq(2.2, 5.5, 0.071)
[1] 2.200 2.271 2.342 2.413 2.484 2.555 2.626 2.697 2.768 2.839 2.910 2.981
[13] 3.052 3.123 3.194 3.265 3.336 3.407 3.478 3.549 3.620 3.691 3.762 3.833
[25] 3.904 3.975 4.046 4.117 4.188 4.259 4.330 4.401 4.472 4.543 4.614 4.685
[37] 4.756 4.827 4.898 4.969 5.040 5.111 5.182 5.253 5.324 5.395 5.466
> class(seq(2.2, 5.5, 0.071))
[1] "numeric"
> typeof(seq(2.2, 5.5, 0.071))
[1] "double"
NA
A vector can contain NA
values, which are like “null” but not the same.
# add some NAs to a vector
> v5 <- c(5, 1, NA, 7, NA, NA, 8)
# using a bad index gives NAs back
> v5[3:17]
[1] NA 7 NA NA 8 NA NA NA NA NA NA NA NA NA NA
Use the is.na
function to ask if each value in a vector is NA
or not. any.na
gives you one answer about the whole vector. na.omit
returns the vector without NA
s. It also includes metadata about which indices were omitted.
> v6 <- na.omit(v5)
> v6
[1] 5 1 7 8
attr(,"na.action")
[1] 3 5 6
attr(,"class")
[1] "omit"
NA
is the appropriate value (or non-value) for missing data.
NULL
NULL
cannot be put in a vector. It’s like void
, except that a variable can equal NULL
. Sometimes, you can give a function NULL
for one of its arguments, and a function can return NULL
. That’s all it’s good for.
Factors
Factors are vectors with the distinct values stored as metadata called “levels.” They are like an array of “enum” types in Java. The vector itself actually only contains integers, indicating which of the unique levels that value represents.
You can convert a vector into a factor like so:
> as.factor(c(1, 2, 3, 3, 2, 1))
[1] 1 2 3 3 2 1
Levels: 1 2 3
Here is a factor of character types:
> as.factor(c("foo", "bar", "foo", "baz", "quux", "foo", "baz"))
[1] foo bar foo baz quux foo baz
Levels: bar baz foo quux
The factor prints as if it was the original vector, but if you use as.integer
you can see the internal numeric IDs and the levels that correspond to the IDs:
> as.integer(as.factor(c("foo", "bar", "foo", "baz", "quux", "foo", "baz")))
[1] 3 1 3 2 4 3 2
> levels(as.factor(c("foo", "bar", "foo", "baz", "quux", "foo", "baz")))
[1] "bar" "baz" "foo" "quux"
Factors are most useful to us when we are creating plots with ggplot.
Data frames
A data frame is like a spreadsheet. Rows and columns can be named. Typically, just columns are named. Internally, each column is a vector (of row values), and of course they are all the same length (the number of rows). Since each column is a different vector, each column can hold a different type of data (but only one type of data per column).
Let’s build a data frame:
> data.frame(c(10, 20, 30, 80), c(40, 50, 60, 70), c(70, 80, 90, 100))
c.10..20..30..80. c.40..50..60..70. c.70..80..90..100.
1 10 40 70
2 20 50 80
3 30 60 90
4 80 70 100
Notice it gave us silly names for the columns. Let’s name them ourselves:
> data.frame(Foo = c(10, 20, 30, 80), Bar = c(40, 50, 60, 70), Baz = c(70, 80, 90, 100))
Foo Bar Baz
1 10 40 70
2 20 50 80
3 30 60 90
4 80 70 100
Also notice the output shows row numbers, which can be used as indices.
You can ask about the number of rows and columns using nrow
and ncol
. Note that length
of this data frame would give 3
because it has three columns (a data frame is a list
of column vectors). The dim
function gives the rows and columns. The names
function gives the column names (as a vector of character
).
> d <- data.frame(Foo = c(10, 20, 30, 80), Bar = c(40, 50, 60, 70), Baz = c(70, 80, 90, 100))
> nrow(d)
[1] 4
> ncol(d)
[1] 3
> length(d)
[1] 3
> typeof(d)
[1] "list"
> class(d)
[1] "data.frame"
> dim(d)
[1] 4 3
> names(d)
[1] "Foo" "Bar" "Baz"
You can attach more columns with cbind
:
> cbind(d, c(3, 4, 2, 1))
Foo Bar Baz c(3, 4, 2, 1)
1 10 40 70 3
2 20 50 80 4
3 30 60 90 2
4 80 70 100 1
# that gave the new column a silly name; let's try again
> cbind(d, Quux = c(3, 4, 2, 1))
Foo Bar Baz Quux
1 10 40 70 3
2 20 50 80 4
3 30 60 90 2
4 80 70 100 1
The rbind
function can add rows:
> rbind(d, c(11, 12, 13), c(15, 16, 17))
Foo Bar Baz
1 10 40 70
2 20 50 80
3 30 60 90
4 80 70 100
5 11 12 13
6 15 16 17
You can cbind
or rbind
two data frames as well:
> cbind(d, d)
Foo Bar Baz Foo Bar Baz
1 10 40 70 10 40 70
2 20 50 80 20 50 80
3 30 60 90 30 60 90
4 80 70 100 80 70 100
> rbind(d, d)
Foo Bar Baz
1 10 40 70
2 20 50 80
3 30 60 90
4 80 70 100
5 10 40 70
6 20 50 80
7 30 60 90
8 80 70 100
Subsetting and filtering
- Use
dframe[row,]
to access a particular row indframe
(a data frame) - Use
dframe[,col]
ordframe[[col]]
ordframe$col
to access a particular column - Use
dframe[row,col]
to access a particular cell; note, a 1-element vector will result - Use
dframe[row,col,drop=FALSE]
to get a particular cell as a data frame
You can also subset a data frame by complex boolean expressions:
> d
Foo Bar Baz
1 10 40 70
2 20 50 80
3 30 60 90
4 80 70 100
> subset(d, Foo >= 10 & Bar <= 50)
Foo Bar Baz
1 10 40 70
2 20 50 80
> subset(d, Foo >= 10 & Bar <= 50, c("Foo", "Baz"))
Foo Baz
1 10 70
2 20 80
Reshaping data frames (melt
and dcast
)
The reshape2
package provides some powerful functions for dramatic transforms of data frames. These transformations come in two forms (which are inverses of each other): melting and casting.
Both melting and casting assume that your data frames consist only of “identifier” and “measured” variables or columns:
-
Identifier (id) variables are those that identify cases that have been measured. For example, id variables may be the person’s first name and last name plus date of birth.
-
Measured variables are those that are measured per case. A person’s height or weight or GPA would be measure variables since they do not identify the case (the person) but are measures of that person.
You can also think about id variables as those you might put on the x-axis, and measure variables as those you might put on the y-axis.
Often we have data frames that look like the following, predefined data frame USArrests
. We use the head
function here to look at the first few rows.
> head(USArrests)
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
The first thing we’re going to do (which has nothing to do with melting/casting) is put the row names (the states) into their own column, and then remove row names.
> d <- cbind(State=rownames(USArrests), USArrests)
> head(d)
State Murder Assault UrbanPop Rape
Alabama Alabama 13.2 236 58 21.2
Alaska Alaska 10.0 263 48 44.5
Arizona Arizona 8.1 294 80 31.0
Arkansas Arkansas 8.8 190 50 19.5
California California 9.0 276 91 40.6
Colorado Colorado 7.9 204 78 38.7
> rownames(d) <- NULL
> head(d)
State Murder Assault UrbanPop Rape
1 Alabama 13.2 236 58 21.2
2 Alaska 10.0 263 48 44.5
3 Arizona 8.1 294 80 31.0
4 Arkansas 8.8 190 50 19.5
5 California 9.0 276 91 40.6
6 Colorado 7.9 204 78 38.7
Next, we’ll “melt” the data frame. The id column is “State”, the measured columns are “Murder”, “Assault”, “UrbanPop”, and “Rape”.
> library(reshape2)
> dmelt <- melt(d, c("State"), c("Murder", "Assault", "UrbanPop", "Rape"))
> head(dmelt)
State variable value
1 Alabama Murder 13.2
2 Alaska Murder 10.0
3 Arizona Murder 8.1
4 Arkansas Murder 8.8
5 California Murder 9.0
6 Colorado Murder 7.9
Notice how each of the measure variables is on a row of its own, and we have new columns “variable” and “value”.
This format is easier to use with ggplot, which we’ll see later.
Casting
After “melting”, we can “cast” the melted data frame to all kinds of different forms. The dcast
(d
for data frame) works as follows:
> dcast(dataframe, formula)
Formulas are written like this (a few variations listed):
id-column-1 ~ measure-column-1
id-column-1 + id-column-2 ~ measure-column-1 + measure-column-2
id-column-1 + id-column-2 ~ ...
. ~ measure-column-1 + measure-column-2
etc.
The parts before the ~
become id columns in the resulting data frame, and the parts after the ~
become the measure columns. A +
means make 2+ columns, just like a “truth table” where A + B
means make columns A
and B
and list the rows so that for each value of A
, go through all values of B
.
The special syntax ...
means “all variables not already listed” and .
means “no variable”.
If a formula results in multiple values for each row (because you didn’t mention all id variables, for example), then you need to provide an “aggregating” function, e.g., mean
to average the multiple values. If you do not provide such a function, length
will be used, meaning it will count how many values match the formula.
Note, dcast
assumes the values are found in the values
column, as produced by melt
.
# get the original data frame back
> head(dcast(dmelt, State ~ variable))
State Murder Assault UrbanPop Rape
1 Alabama 13.2 236 58 21.2
2 Alaska 10.0 263 48 44.5
3 Arizona 8.1 294 80 31.0
4 Arkansas 8.8 190 50 19.5
5 California 9.0 276 91 40.6
6 Colorado 7.9 204 78 38.7
# flip the data frame (states as columns)
> head(dcast(dmelt, variable ~ State))
variable Alabama Alaska Arizona Arkansas California Colorado ...
1 Murder 13.2 10.0 8.1 8.8 9.0 7.9
2 Assault 236.0 263.0 294.0 190.0 276.0 204.0
3 UrbanPop 58.0 48.0 80.0 50.0 91.0 78.0
4 Rape 21.2 44.5 31.0 19.5 40.6 38.7
# we can supply an aggregation function;
# the . means no id variable, i.e., all states combined
> head(dcast(dmelt, . ~ variable, mean))
. Murder Assault UrbanPop Rape
1 . 7.788 170.76 65.54 21.232
Here is another built-in data frame:
> head(ChickWeight)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
Let’s melt it on id variables “Chick”, “Diet”, and “Time”:
> cmelt <- melt(ChickWeight, c("Chick", "Diet", "Time"), c("weight"))
> head(cmelt)
Chick Diet Time variable value
1 1 1 0 weight 42
2 1 1 2 weight 51
3 1 1 4 weight 59
4 1 1 6 weight 64
5 1 1 8 weight 76
6 1 1 10 weight 93
This casting gives the mean of “Time” vs. “variable” (which is only “weight”):
> head(dcast(cmelt, Time ~ variable, mean))
Time weight
1 0 41.06000
2 2 49.22000
3 4 59.95918
4 6 74.30612
5 8 91.24490
6 10 107.83673
Here we have “Time” vs. “Diet”. The “Diet” unique values become columns.
> head(dcast(cmelt, Time ~ Diet, mean))
Time 1 2 3 4
1 0 41.40000 40.7 40.8 41.0
2 2 47.25000 49.4 50.4 51.8
3 4 56.47368 59.8 62.2 64.5
4 6 66.78947 75.4 77.9 83.9
5 8 79.68421 91.7 98.4 105.6
6 10 93.05263 108.5 117.1 126.0
If we don’t provide mean
as the aggregator, we’ll get a warning and it will default to length
. This is because for each “Diet” value (1-4), there are 10-20 chicks and therefore 10-20 weight measures.
> head(dcast(cmelt, Time ~ Diet))
Aggregation function missing: defaulting to length
Time 1 2 3 4
1 0 20 10 10 10
2 2 20 10 10 10
3 4 19 10 10 10
4 6 19 10 10 10
5 8 19 10 10 10
6 10 19 10 10 10
If you use library(plyr)
, you can also do subsets. Notice the subset = .(Time < 10)
part.
> library(plyr)
> head(dcast(cmelt, Time ~ Diet, mean, subset = .(Time < 10)))
Time 1 2 3 4
1 0 41.40000 40.7 40.8 41.0
2 2 47.25000 49.4 50.4 51.8
3 4 56.47368 59.8 62.2 64.5
4 6 66.78947 75.4 77.9 83.9
5 8 79.68421 91.7 98.4 105.6
Aggregation on data frames
A different way to produce column means or sums or whatever, without using melt
and dcast
, is to use aggregate
.
We’ll use the ChickWeight
data frame again.
> head(ChickWeight)
weight Time Chick Diet
1 42 0 1 1
2 51 2 1 1
3 59 4 1 1
4 64 6 1 1
5 76 8 1 1
6 93 10 1 1
aggregate
uses “formulas”, too, like dcast
, but aggregate
’s formulas are written the other way:
measure-column-1 ~ id-column-1
cbind(measure-column-1, measure-column-2) ~ id-column-1 + id-column-2
Here we go:
# find average weight per diet
> aggregate(weight ~ Diet, ChickWeight, mean)
Diet weight
1 1 102.6455
2 2 122.6167
3 3 142.9500
4 4 135.2627
# find average weight per diet per time
> head(aggregate(weight ~ Diet + Time, ChickWeight, mean))
Diet Time weight
1 1 0 41.40
2 2 0 40.70
3 3 0 40.80
4 4 0 41.00
5 1 2 47.25
6 2 2 49.40
Switching data frames to diamonds
inside the ggplot2
library:
> library(ggplot2)
> head(diamonds)
carat cut color clarity depth table price x y z
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Let’s find the maximum carat per cut:
> aggregate(carat ~ cut, diamonds, max)
cut carat
1 Fair 5.01
2 Good 3.01
3 Very Good 4.00
4 Premium 4.01
5 Ideal 3.50
If you want two measured columns, use cbind()
:
> aggregate(cbind(carat, depth) ~ cut, diamonds, max)
cut carat depth
1 Fair 5.01 79.0
2 Good 3.01 67.0
3 Very Good 4.00 64.9
4 Premium 4.01 63.0
5 Ideal 3.50 66.7
Next we’ll find the count of diamonds in the data frame with various clarities. We’ll use a bogus column cut
just to do the aggregation, but use length
to count how many cut
values there are for each clarity
. We could have used any column that’s not clarity
to count up the same way.
> aggregate(cut ~ clarity, diamonds, length)
clarity cut
1 I1 741
2 SI2 9194
3 SI1 13065
4 VS2 12258
5 VS1 8171
6 VVS2 5066
7 VVS1 3655
8 IF 1790
Practice
Take the tips
data frame:
> head(tips)
total_bill tip sex smoker day time size
1 16.99 1.01 Female No Sun Dinner 2
2 10.34 1.66 Male No Sun Dinner 3
3 21.01 3.50 Male No Sun Dinner 3
4 23.68 3.31 Male No Sun Dinner 2
5 24.59 3.61 Female No Sun Dinner 4
6 25.29 4.71 Male No Sun Dinner 4
And produce these data frames with melt
and dcast
:
# Data frame 1 (melt + dcast)
sex day total_bill tip
1 Female Fri 14.14556 2.781111 <--- total_bill & tip are means
2 Female Sat 19.68036 2.801786
3 Female Sun 19.87222 3.367222
4 Female Thur 16.71531 2.575625
5 Male Fri 19.85700 2.693000
6 Male Sat 20.80254 3.083898
7 Male Sun 21.88724 3.220345
8 Male Thur 18.71467 2.980333
# Data frame 2 (melt + dcast)
sex day total_bill tip
1 Female Fri 127.31 25.03 <--- total_bill & tip are sums
2 Female Sat 551.05 78.45
3 Female Sun 357.70 60.61
4 Female Thur 534.89 82.42
5 Male Fri 198.57 26.93
6 Male Sat 1227.35 181.95
7 Male Sun 1269.46 186.78
8 Male Thur 561.44 89.41
# Data frame 3 (melt + dcast)
sex Fri_tip Sat_tip Sun_tip Thur_tip
1 Female 2.781111 2.801786 3.367222 2.575625 <--- these are means
2 Male 2.693000 3.083898 3.220345 2.980333
# Data frame 4 (melt + dcast)
# (note: this double-counts the records since there
day No Yes <--- No = non smoker, Yes = smoker
1 Fri 8 30 <--- these are counts (length)
2 Sat 90 84
3 Sun 114 38
4 Thur 90 34
# Data frame 4 (melt + dcast)
day time No Yes <--- No = non smoker, Yes = smoker
1 Fri Dinner 6 18 <--- these are counts (length)
2 Fri Lunch 2 12
3 Sat Dinner 90 84
4 Sun Dinner 114 38
5 Thur Dinner 2 0
6 Thur Lunch 88 34
Now do the following with aggregate
:
# Data frame 1 (aggregate)
sex day total_bill tip
1 Female Fri 14.14556 2.781111 <--- these are means
2 Male Fri 19.85700 2.693000
3 Female Sat 19.68036 2.801786
4 Male Sat 20.80254 3.083898
5 Female Sun 19.87222 3.367222
6 Male Sun 21.88724 3.220345
7 Female Thur 16.71531 2.575625
8 Male Thur 18.71467 2.980333
# Data frame 2 (aggregate)
sex day tip <--- bogus column name, just counting male/female per day
1 Female Fri 9
2 Male Fri 10
3 Female Sat 28
4 Male Sat 59
5 Female Sun 18
6 Male Sun 58
7 Female Thur 32
8 Male Thur 30
Merging data frames
We can combine or “merge” two data frames in a way similar to a relational database. You must specify a column (or columns) in both data frames that acts as the “key”.
Suppose we have these two data frames (from the example documentation ?merge
):
> authors <- data.frame(
+ surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
+ nationality = c("US", "Australia", "US", "UK", "Australia"),
+ deceased = c("yes", rep("no", 4)))
> books <- data.frame(
+ name = I(c("Tukey", "Venables", "Tierney",
+ "Ripley", "Ripley", "McNeil", "R Core")),
+ title = c("Exploratory Data Analysis",
+ "Modern Applied Statistics ...",
+ "LISP-STAT",
+ "Spatial Statistics", "Stochastic Simulation",
+ "Interactive Data Analysis",
+ "An Introduction to R"),
+ other.author = c(NA, "Ripley", NA, NA, NA, NA,
+ "Venables & Smith"))
> authors
surname nationality deceased
1 Tukey US yes
2 Venables Australia no
3 Tierney US no
4 Ripley UK no
5 McNeil Australia no
> books
name title other.author
1 Tukey Exploratory Data Analysis <NA>
2 Venables Modern Applied Statistics ... Ripley
3 Tierney LISP-STAT <NA>
4 Ripley Spatial Statistics <NA>
5 Ripley Stochastic Simulation <NA>
6 McNeil Interactive Data Analysis <NA>
7 R Core An Introduction to R Venables & Smith
We can create a new, merged data frame by combining the two on the “surname” column in authors
and the “name” column in books
:
> merge(authors, books, by.x="surname", by.y="name")
surname nationality deceased title other.author
1 McNeil Australia no Interactive Data Analysis <NA>
2 Ripley UK no Spatial Statistics <NA>
3 Ripley UK no Stochastic Simulation <NA>
4 Tierney US no LISP-STAT <NA>
5 Tukey US yes Exploratory Data Analysis <NA>
6 Venables Australia no Modern Applied Statistics ... Ripley
Note that “R Core” is in authors
but not books
, so it’s left out of the merge. The all=TRUE
option keeps it:
> merge(authors, books, by.x="surname", by.y="name", all=TRUE)
surname nationality deceased title other.author
1 McNeil Australia no Interactive Data Analysis <NA>
2 R Core <NA> <NA> An Introduction to R Venables & Smith
3 Ripley UK no Spatial Statistics <NA>
4 Ripley UK no Stochastic Simulation <NA>
5 Tierney US no LISP-STAT <NA>
6 Tukey US yes Exploratory Data Analysis <NA>
7 Venables Australia no Modern Applied Statistics ... Ripley
You can merge on a key composed of 2+ columns if they’re named the same in both data frames. Here are two new data frames:
> d1 <- data.frame(firstname=c("Mary", "Beth", "Beth"),
+ lastname=c("Staples", "Wrench", "Hammer"),
+ income=c(111230,27200,83000))
> d1
firstname lastname income
1 Mary Staples 111230
2 Beth Wrench 27200
3 Beth Hammer 83000
> d2 <- data.frame(firstname=c("Mary", "Beth", "Beth"),
+ lastname=c("Staples", "Wrench", "Hammer"),
+ birthdate=c(as.Date("1982-01-05"), as.Date("1960-10-25"), as.Date("1990-11-02")),
+ eyecolor=c("blue", "green", "brown"))
> d2
firstname lastname birthdate eyecolor
1 Mary Staples 1982-01-05 blue
2 Beth Wrench 1960-10-25 green
3 Beth Hammer 1990-11-02 brown
> merge(d1, d2, by=c("firstname", "lastname"))
firstname lastname income birthdate eyecolor
1 Beth Wrench 27200 1960-10-25 green
2 Beth Hammer 83000 1990-11-02 brown
3 Mary Staples 111230 1982-01-05 blue
String operations
require(stringr)
str_sub(string, start, end)
str_replace(string, pattern, replacement)