Download this worksheet as a .Rmd
Data Structures and Subsetting
Aim of this worksheet
After completing this worksheet, you should have a basic familiarity with the most useful types of data structures in R (lists, data frames, and matrices). You should also be able to subset any of these data structures using the common subsetting operators ([
, [[
, and $
).
You may find the chapters on data structures and subsetting from Hadley Wickham’s Advanced R book to be helpful.
Subsetting vectors
In the previous worksheet you learned about vectors. Vectors hold a one-dimensional set of homogenous values. By homogenous we mean that that the values all have to be the same type (integer, numeric, logical, and so on) and you can’t have both, say, numeric and logical values in the same vector. By one-dimensional we mean that there can be more than one element accessed by a single index. For example, R includes a character vector of the letters in the English alphabet, helpfully called, letters
.
letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q"
## [18] "r" "s" "t" "u" "v" "w" "x" "y" "z"
The letters
object is a character vector because everything inside of it is a character vector.
class(letters)
## [1] "character"
And it has twenty-six elements:
length(letters)
## [1] 26
We can subset a vector using the [
operator (actually, it’s a function). For instance, we can get the first element like this:
letters[1]
## [1] "a"
- How would you get the twenty-fifth element?
We can get a range of values like so. Notice that we can get all the numbers one to five like this:
1:5
## [1] 1 2 3 4 5
So, we can get the first five letters like this:
letters[1:5]
## [1] "a" "b" "c" "d" "e"
- Can you get the tenth through twelfth letters?
You can get an arbitrary subset by creating a numeric (or integer) vector. Here we get the first, tenth, and twelfth letters.
letters[c(1, 10, 12)]
## [1] "a" "j" "l"
We can also do this by creating a variable and using it to do the subsetting.
what_i_want <- c(1, 3, 5, 7)
letters[what_i_want]
## [1] "a" "c" "e" "g"
Create a variable with the even numbers and then subset the
letters
variable to get the even letters (e.g., the second, fourth, etc.)Bonus: can you use the
seq()
function (look it up with?seq
) to get the even letters in a more clever way?
In addition to values, vectors can also have names. For example, let’s create a variable with the numbers 1 to 5, then give those values some names.
myvar <- 1:5
names(myvar) <- letters[1:5]
myvar
## a b c d e
## 1 2 3 4 5
Now we can also subset the vector based on the names:
myvar["c"]
## c
## 3
- Below is a vector of rankings for songs. Give the numeric vector names, which should be the titles of songs. Finally, subset the vector by the title of one of the songs to retrieve its ranking.
song_rankings <- c(10, 8.4, 6, 8.2, 4)
Matrices
Vectors are one-dimensional and homogeneous. Matrices are two-dimensional and homogenous, so they have the same kind of value, but have rows and columns as well. A matrix can be used for all kinds of problems in digital history. For now, let’s imagine we have four cites, A, B, C, and D, and have measured the distances between them. For instance, the distance from A to B is 2. We can represent those distances as a matrix with 4 columns and 4 rows, where the names of the rows and columns are the cities.
city_distances <- matrix(c(0, 2, 8, 3, 2, 0, 6, 1, 8, 6, 0, 4, 3, 1, 4, 0),
nrow = 4, ncol = 4)
rownames(city_distances) <- LETTERS[1:4]
colnames(city_distances) <- LETTERS[1:4]
city_distances
## A B C D
## A 0 2 8 3
## B 2 0 6 1
## C 8 6 0 4
## D 3 1 4 0
A matrix can be subsetted in the same way that a vector can be subsetted. (Because it is a vector—just a vector with two dimensions.) For instance, we can get the third element of the matrix.
city_distances[3]
## [1] 8
- Now get the fifth element of the matrix.
But matrices are more useful when we subset them by row and column. For instance, here is the value contained in the cell for the first row and third column.
city_distances[1, 3]
## [1] 8
Now get the value for the third row and the first column.
What cities are we getting the distances for when we look for the third row and first column?
If a matrix has row and column names, we can subset the vector by that. For instance, here is the distance between cities B and D.
city_distances["B", "D"]
## [1] 1
What is the distance between cities D and C?
What is the distance between city A and cities B, C, and D? (Hint: think back to the kinds of subsetting that we did with vectors.)
Data frames
The most useful data structure in R is the data frame. Think of a data frame like a spreadsheet that holds any kind of tabular data. It is two dimensional, like a matrix; unlike a matrix, it is homogenous in that the columns can hold different kinds of data. While other langauges have add on libraries that allow for data structures like this, in R data frames are a first class citizen.
Let’s get a data frame from the historydata package. (If we also load dplyr, we will get nicer printing.)
library(historydata)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
naval_promotions
## Source: local data frame [5,705 x 5]
##
## id name generation rank date
## (int) (chr) (int) (fctr) (chr)
## 1 1 Abbot, Joel 3 midshipman 1812-06-18
## 2 2 Abbot, Trevett 4 midshipman 1848-10-13
## 3 3 Abbott, Isaac 4 midshipman 1820-05-10
## 4 4 Abbott, J. Francis 4 midshipman 1837-12-27
## 5 5 Abbott, James W. 4 midshipman 1822-05-01
## 6 6 Abbott, Thomas C. 3 midshipman 1814-12-06
## 7 7 Abbott, Walter 3 midshipman 1812-01-01
## 8 8 Abbott, William A. 4 midshipman 1848-10-13
## 9 9 Abercrombie, Alexander R. 4 midshipman 1841-10-19
## 10 10 Abercrombie, J.B. 4 midshipman 1817-01-01
## .. ... ... ... ... ...
We can use str()
to get a different look at the data.
str(naval_promotions)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5705 obs. of 5 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ name : chr "Abbot, Joel" "Abbot, Trevett" "Abbott, Isaac" "Abbott, J. Francis" ...
## $ generation: int 3 4 4 4 4 3 3 4 4 4 ...
## $ rank : Factor w/ 5 levels "midshipman","lieutenant",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ date : chr "1812-06-18" "1848-10-13" "1820-05-10" "1837-12-27" ...
- How many rows and columns does the data frame have? What are the different types of vectors contained in it?
We can use the [
subset function to get access to rows and columns, just like we did with matrices. For instance, here we ask for just the first row and the columned named "name"
:
naval_promotions[1, "name"]
## Source: local data frame [1 x 1]
##
## name
## (chr)
## 1 Abbot, Joel
We can also ask for the entire first row (note the comma):
naval_promotions[1, ]
## Source: local data frame [1 x 5]
##
## id name generation rank date
## (int) (chr) (int) (fctr) (chr)
## 1 1 Abbot, Joel 3 midshipman 1812-06-18
- What is the name of the person in the tenth row? What was the date he became that rank?
Because data frames are organized by column, it is possible to extract an entire column using the $
function. Here, let’s get just the names of the people. (We’ll limit it using head()
so we don’t print out all of them.)
head(naval_promotions$name, 10)
## [1] "Abbot, Joel" "Abbot, Trevett"
## [3] "Abbott, Isaac" "Abbott, J. Francis"
## [5] "Abbott, James W." "Abbott, Thomas C."
## [7] "Abbott, Walter" "Abbott, William A."
## [9] "Abercrombie, Alexander R." "Abercrombie, J.B."
The function unique()
gives you, well, all the unique values in a vector. For instance:
unique(c(1, 1, 1, 2, 3))
## [1] 1 2 3
What are all the different ranks contained in the
naval_promotions
dataset? (Hint: useunique()
.)How many different people are contained in the
naval_promotions
dataset?Why is that number different than the number of rows?
- What is the earliest and latest date in the dataset? (Hint: you may find some combination of
sort()
,head()
,tail()
,range()
, andas.Date()
to be useful. Don’t forget aboutna.rm = TRUE
as appropriate.)
We will work much, much more with data frames.
Lists
Another very useful kind of data structure is the list. A list can hold values of any type, including vectors, data frames, and even other lists. For instance, we can create a list that holds several different kinds of information:
our_class <- list(
title = "Intro to R",
year = 2016,
books = c("Basics of R", "Get Awesome at R"),
students = c("Adam", "Betsy", "Cynthia", "David")
)
str(our_class)
## List of 4
## $ title : chr "Intro to R"
## $ year : num 2016
## $ books : chr [1:2] "Basics of R" "Get Awesome at R"
## $ students: chr [1:4] "Adam" "Betsy" "Cynthia" "David"
We can get just part of the list using the [
function that we’ve become used to. For instance, to get just the title:
our_class["title"]
## $title
## [1] "Intro to R"
But notice that the returned value is a list, not a character vector
is.list(our_class["title"])
## [1] TRUE
is.character(our_class["title"])
## [1] FALSE
R has another subset operator, [[
. The single bracket ([
) gives us what we asked for inside a list; the double bracket ([[
) simplifies the list to give us the vector (or whatever) we asked for.
our_class[["title"]]
## [1] "Intro to R"
is.list(our_class[["title"]])
## [1] FALSE
is.character(our_class[["title"]])
## [1] TRUE
Using the
our_class
list, get just the year (as a list). Get it as a numeric vector.Using the
our_class
list, get the class title, the book list, and the year (as a list). (Hint: remember the different kinds of subsetting we did with[
.)
R also lets us use the $
operator we used to get columns of a data frame.
our_class$title
## [1] "Intro to R"
Get the students vector from
our_class
:Is the
$
equivalent to[
or[[
?
Create a list that models a historic event. What parts of the event are worth keeping track of? Also extract certain parts of that list.
You can use
$
,[
, and[[
on both data frames and lists. What is the relationship between a data frame and a list? (Hint: useis.list()
andis.data.frame()
on a list and a data frame.)
Subsetting with logical vectors
We have done subsetting above using numeric and character vectors. We can also do subsetting using logical vectors. Let’s create a sample dataset of the heights of soldiers.
set.seed(3929)
heights <- rnorm(20, mean = 69)
names(heights) <- letters[1:20]
heights
## a b c d e f g h
## 67.82345 69.04801 69.59076 68.86947 68.54834 68.13089 68.82323 67.90888
## i j k l m n o p
## 71.88773 66.51703 68.58429 68.23702 70.36570 68.33046 69.48776 69.86949
## q r s t
## 69.78694 68.90278 68.82545 69.47962
We can find all the soldiers who are taller than average like this. First let’s compare all the heights to the mean height.
heights > mean(heights)
## a b c d e f g h i j k l
## FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## m n o p q r s t
## TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
Notice that we get a logical vector as a result. We can use that within the [
operator to get just the soldiers who are taller than average.
heights[heights > mean(heights)]
## b c i m o p q t
## 69.04801 69.59076 71.88773 70.36570 69.48776 69.86949 69.78694 69.47962
- Which soldiers are taller than 70 inches?
This kind of subsetting also works for data frames. Here we get all the officers from the first generation. (Notice the comma.)
first_gen <- naval_promotions[naval_promotions$generation == 1, ]
head(first_gen, 10)
## Source: local data frame [10 x 5]
##
## id name generation rank date
## (int) (chr) (int) (fctr) (chr)
## 1 76 Archer, John 1 lieutenant 1798-11-08
## 2 117 Bainbridge, William 1 lieutenant 1798-08-03
## 3 127 Baker, Thomas 1 lieutenant 1798-05-25
## 4 141 Ballard, John 1 lieutenant 1798-10-02
## 5 168 Barron, James 1 lieutenant 1798-03-09
## 6 180 Barton, Jeremiah 1 lieutenant 1798-06-08
## 7 259 Blair, George 1 lieutenant 1799-03-13
## 8 434 Burns, James 1 lieutenant 1798-10-29
## 9 453 Byrne, Gerald 1 lieutenant 1799-06-17
## 10 474 Campbell, James 1 lieutenant 1799-09-20
Can you get just the promotions to captain?
Can you get the promotions from 1800? (Hint: you will first have to make the date column a date object.)