More data manipulation with dplyr and tidy

Home Getting Familiar with R Data Structures Functions Basics of ggplot2 Data Manipulation More Data Manipulation

Aims of this worksheet
Data joining with two table verbs (left_join() et al.)
- Data lookups
Data reshaping (spread() and gather())
Window functions

Aims of this worksheet

In an earlier worksheet, you learned the basic data manipulation verbs from the dplyr package: select(), filter(), mutate(), arrange(), group_by(), and summarize(). In this worksheet you will learn additional data verbs from the dplyr and tidyr packages. These data verbs relate to window functions (lead() and lag()), data table joins (left_join() et al.), and data reshaping (spread() and gather())

To begin, we will load the necessary packages, as well as the Methodist data

library(dplyr)
library(tidyr)

## Warning: package 'tidyr' was built under R version 3.2.3

library(readr)
library(historydata)
library(ggplot2)
methodists <- read_csv("http://lincolnmullen.com/projects/worksheets/data/methodists.csv")

Data joining with two table verbs (`left_join()` et al.)

Data lookups

It is often the case that we want to use some variable in our data to create a new variable. Consider the Methodist data for the year 1800. Perhaps we are interested in the racial composition of the churches. Do they tend to be all white and all black, or do some churches have both white and black members in varying proportions? The simplest way to get a look at that question is to create a scatter plot of the figures for white and black membership.

methodists_1800 <- methodists %>% 
  filter(minutes_year == 1800) %>% 
  select(meeting, state, members_white, members_colored)

ggplot(methodists_1800, aes(x = members_white, y = members_colored)) +
  geom_point(shape = 1)

That scatterplot is interesting as far as it goes, but we might reasonably suspect that the racial composition of methodist meetings varies by region. We could use the state variable to facet the plot by state. However, this has two problems. There are 20 states represented in that year.

methodists_1800$state %>% unique() %>% length()

## [1] 20

Our faceted plot would have 20 panels, which is too many. But more important, by looking at individual states we might be getting too fine grained a look at the data. We have good reason to think that it is regions that matter more than states.

It is easy enough to describe what we would do to translate states into a new column with regions. We would look at each state name and assign it to a region. Connecticut would be in the Northeast, New York would be in the Mid-Atlantic, and so on. So far when we have created new columns in a data frame we have done so with mutate().¹ Another way to think of this problem, though, is to think of looking up the state names in a table where they associated with regions. We can create such a data frame with the code below. In many cases, though, it would make more sense to create a CSV file with the data and read it in as a data frame.

regions <- data_frame(
  states = c("Connecticut", "Delaware", "Georgia", "Kentucky", "Maine", 
             "Maryland", "Massachusetts", "Mississippi", "New Hampshire", 
             "New Jersey", "New York", "North Carolina",
             "Northwestern Territory", "Pennsylvania", "Rhode Island",
             "South Carolina", "Tennessee", "Upper Canada", "Vermont",
             "Virginia"),
  region = c("Northeast", "Atlantic South", "Atlantic South", "West",
             "Northeast", "Atlantic South", "Northeast", "Deep South", 
             "Northeast", "Mid-Atlantic", "Mid-Atlantic", "Atlantic South",
             "West", "Mid-Atlantic", "Northeast", "Atlantic South", "West",
             "Canada", "Northeast", "Atlantic South")
)

And now we can inspect the table.

regions

## Source: local data frame [20 x 2]
## 
##                    states         region
##                     (chr)          (chr)
## 1             Connecticut      Northeast
## 2                Delaware Atlantic South
## 3                 Georgia Atlantic South
## 4                Kentucky           West
## 5                   Maine      Northeast
## 6                Maryland Atlantic South
## 7           Massachusetts      Northeast
## 8             Mississippi     Deep South
## 9           New Hampshire      Northeast
## 10             New Jersey   Mid-Atlantic
## 11               New York   Mid-Atlantic
## 12         North Carolina Atlantic South
## 13 Northwestern Territory           West
## 14           Pennsylvania   Mid-Atlantic
## 15           Rhode Island      Northeast
## 16         South Carolina Atlantic South
## 17              Tennessee           West
## 18           Upper Canada         Canada
## 19                Vermont      Northeast
## 20               Virginia Atlantic South

We can do a look up where we take the state column in the methodists_1800 data frame and associate it with the states column in our regions data frame. The result will be a new column region. Notice how we use the by = argument to specify which column in the left hand table matches which column in the right hand table.

methodists_region <- methodists_1800 %>% 
  left_join(regions, by = c("state" = "states"))

methodists_region

## Source: local data frame [169 x 5]
## 
##        meeting          state members_white members_colored         region
##          (chr)          (chr)         (int)           (int)          (chr)
## 1      Augusta        Georgia            61               9 Atlantic South
## 2        Burke        Georgia           297              36 Atlantic South
## 3     Richmond        Georgia           548             115 Atlantic South
## 4   Washington        Georgia           497              92 Atlantic South
## 5  Broad River South Carolina           604              62 Atlantic South
## 6   Bush River South Carolina           328              31 Atlantic South
## 7   Charleston South Carolina            60             440 Atlantic South
## 8     Cherokee South Carolina            79               0 Atlantic South
## 9       Edisto South Carolina           572             126 Atlantic South
## 10  Georgetown South Carolina            10             223 Atlantic South
## ..         ...            ...           ...             ...            ...

Then we can plot the results. As we suspected, there is a huge regional variation.

ggplot(methodists_region, aes(x = members_white, y = members_colored)) +
  geom_point(shape = 1) +
  facet_wrap(~ region)

Beginning in 1802, the Methodist minutes no longer kept track of the state in which a meeting was held. Instead it classified them into different conferences. For the years 1802 and after, create a look up table assigning the different conferences to regions of your choice, and join that table back to the Methodists data.
Can you summarize the racial composition of the different regions by year (i.e., a region had \(x\) percent white and black members for a given year) and create a plot of the changing racial composition in each region over time?
In the europop package there are two data frames, europop with the historical populations of European cities, and city_coords which has the latitudes and longitudes of those cities. Load that package and join the two tables together. Can you get the populations of cities north of 48° of latitude?
In the historydata package there are two tables, judges_people and judges_appointments. Join them together. What are the names of black judges who were appointed to the Supreme Court?
What courts did those justices serve on before the Supreme Court?

Data reshaping (`spread()` and `gather()`)

It can be helpful to think of tabular data as coming in two forms: wide data, and long data. Let’s load in a table of data. This data contains total membership figures for the Virginia conference of the Methodist Episcopal Church for the years 1812 to 1830.

va_methodists_wide <- read_csv("http://lincolnmullen.com/projects/worksheets/data/va-methodists-wide.csv")
va_methodists_wide

## Source: local data frame [10 x 21]
## 
##    conference    district  1812  1813  1814  1815  1816  1817  1818  1819
##         (chr)       (chr) (int) (int) (int) (int) (int) (int) (int) (int)
## 1    Virginia James River  5348  4691  4520  4209  4118  3888  3713  3580
## 2    Virginia    Meherren  4882  4486  4771  4687  4702    NA    NA    NA
## 3    Virginia    Meherrin    NA    NA    NA    NA    NA  4435  3964  3860
## 4    Virginia       Neuse    NA    NA  3474  3475  3448  2702  3340  4667
## 5    Virginia     Newbern  3511  3558    NA    NA    NA    NA    NA    NA
## 6    Virginia     Norfolk  4686  6196  6127  6001  5661  6495  6471    NA
## 7    Virginia     Raleigh  3822  4018    NA    NA    NA    NA    NA    NA
## 8    Virginia     Roanoke    NA    NA    NA    NA  3049    NA  1507    NA
## 9    Virginia   Tar River    NA    NA  3834  3466    NA    NA    NA    NA
## 10   Virginia      Yadkin  3174  3216  3528  3323  3374  3323  4689  4547
## Variables not shown: 1820 (int), 1821 (int), 1822 (int), 1823 (int), 1824
##   (int), 1825 (int), 1826 (int), 1827 (int), 1828 (int), 1829 (int), 1830
##   (int)

The first thing we can notice about this data frame is that it is very wide because it has a column for each of the years. The data is also suitable for reading because it like a table in a publication. We can read from left to right and see when certain districts begin and end and get the values for each year. The difficulties of computing on or plotting the data will also become quickly apparent. How would you make a plot of the change over time in the number of members in each district? Or how would you filter by year, or summarize by year? For that matter, what do the numbers in the table represent, since they are not given an explicit variable name?

The problem with the table is that it is not tidy data, because the variables are not in columns and observations in rows. One of the variables is the year, but its values are in the column headers. And another of the variables is total membership, but its values are spread across rows and columns and it is not explicitly named.

The gather() function from the tidyr package lets us turn wide data into long data. We need to tell the function two kinds of information. First we need to tell it the name of the column to create from the column headers and the name of the implicit variable in the rows. In the example below, we create to new columns minutes_year and total_membership. Then we also have to tell the function if there are any columns which should remain unchanged. In this case, the conference and district variables should remain the same, so we remove them from the gathering using the same syntax as the select() function.

va_methodists_wide %>% 
  gather(minutes_year, total_membership, -conference, -district)

## Source: local data frame [190 x 4]
## 
##    conference    district minutes_year total_membership
##         (chr)       (chr)        (chr)            (int)
## 1    Virginia James River         1812             5348
## 2    Virginia    Meherren         1812             4882
## 3    Virginia    Meherrin         1812               NA
## 4    Virginia       Neuse         1812               NA
## 5    Virginia     Newbern         1812             3511
## 6    Virginia     Norfolk         1812             4686
## 7    Virginia     Raleigh         1812             3822
## 8    Virginia     Roanoke         1812               NA
## 9    Virginia   Tar River         1812               NA
## 10   Virginia      Yadkin         1812             3174
## ..        ...         ...          ...              ...

We can see the results above. There are two ways that this result is not quite what we want. Because the years were column headers they are treated as character vectors rather than integers. We can manually convert them in a later step, but we can also let gather() do the right thing with the convert = argument. Then we have a lot of NA values which were explicit in the wide table but which can be removed from the long table with na.rm =.

va_methodists_long <- va_methodists_wide %>% 
  gather(minutes_year, total_membership, -conference, -district, 
         convert = TRUE, na.rm = TRUE)

va_methodists_long

## Source: local data frame [100 x 4]
## 
##    conference    district minutes_year total_membership
##         (chr)       (chr)        (int)            (int)
## 1    Virginia James River         1812             5348
## 2    Virginia    Meherren         1812             4882
## 3    Virginia     Newbern         1812             3511
## 4    Virginia     Norfolk         1812             4686
## 5    Virginia     Raleigh         1812             3822
## 6    Virginia      Yadkin         1812             3174
## 7    Virginia James River         1813             4691
## 8    Virginia    Meherren         1813             4486
## 9    Virginia     Newbern         1813             3558
## 10   Virginia     Norfolk         1813             6196
## ..        ...         ...          ...              ...

Notice that now we can use the data in ggplot2 without any problem.

ggplot(va_methodists_long, 
       aes(x = minutes_year, y = total_membership, color = district)) +
  geom_line() +
  ggtitle("Membership of districts in the Virginia conference")

The inverse operation of gather() is spread(). With spread() we specify the name of the column which should become the new column headers (in this case minutes_year), and then the name of the column to fill in underneath those new column headers (in this case, total_membership). We can see the results below.

va_methodists_wide2 <- va_methodists_long %>% 
  spread(minutes_year, total_membership)

va_methodists_wide2

## Source: local data frame [10 x 21]
## 
##    conference    district  1812  1813  1814  1815  1816  1817  1818  1819
##         (chr)       (chr) (int) (int) (int) (int) (int) (int) (int) (int)
## 1    Virginia James River  5348  4691  4520  4209  4118  3888  3713  3580
## 2    Virginia    Meherren  4882  4486  4771  4687  4702    NA    NA    NA
## 3    Virginia    Meherrin    NA    NA    NA    NA    NA  4435  3964  3860
## 4    Virginia       Neuse    NA    NA  3474  3475  3448  2702  3340  4667
## 5    Virginia     Newbern  3511  3558    NA    NA    NA    NA    NA    NA
## 6    Virginia     Norfolk  4686  6196  6127  6001  5661  6495  6471    NA
## 7    Virginia     Raleigh  3822  4018    NA    NA    NA    NA    NA    NA
## 8    Virginia     Roanoke    NA    NA    NA    NA  3049    NA  1507    NA
## 9    Virginia   Tar River    NA    NA  3834  3466    NA    NA    NA    NA
## 10   Virginia      Yadkin  3174  3216  3528  3323  3374  3323  4689  4547
## Variables not shown: 1820 (int), 1821 (int), 1822 (int), 1823 (int), 1824
##   (int), 1825 (int), 1826 (int), 1827 (int), 1828 (int), 1829 (int), 1830
##   (int)

Just by looking at the data we can see that we got back to where we started, but we can also verify that programmatically.

identical(va_methodists_wide, va_methodists_wide2)

## [1] TRUE

Turning long data into wide is often useful when you want to create a tabular representation of data. (And once you have a data frame that can be a table, the knitr::kable() function is quite nice.) And some algorithms, such as clustering algorithms, expect wide data rather than tidy data.

For the exercise, we will use summary statistics of the number of white and black members in the Methodists by year.

methodists_by_year_race <- methodists %>% 
  filter(minutes_year >= 1786) %>% 
  group_by(minutes_year) %>% 
  summarize(white = sum(members_white, na.rm = TRUE),
            black = sum(members_colored, na.rm = TRUE),
            indian = sum(as.integer(members_indian), na.rm = TRUE))
methodists_by_year_race

## Source: local data frame [45 x 4]
## 
##    minutes_year white black indian
##           (int) (int) (int)  (int)
## 1          1786 18291  2890      0
## 2          1787 21949  3883      0
## 3          1788 30557  7991      0
## 4          1789 34425  8840      0
## 5          1790 45983 11682      0
## 6          1791 50580 13098      0
## 7          1792 52079 13871      0
## 8          1793 51486 14420      0
## 9          1794 52794 13906      0
## 10         1795 48121 12171      0
## ..          ...   ...   ...    ...

The data in methodists_by_year_race could be tidier still. While white, black, and indian are variables, it is perhaps better to think of them as two different variables. One variable would be race, containing the racial descriptions that the Methodists used, and another would be members, containing the number of members. Using the gather() function, create that data frame.
Use the data frame you created in the previous step to create a line plot of membership over time, mapping the race column to the color aesthetic.
Now use that newly tidied data frame to create a wide data frame, where the years are the column headers and the racial descriptions are the rows.
Now use the same tidied data to create a wide data frame where the racial descriptions are column headers and the years are rows.

Window functions

There are a number of different kinds of window functions in R. We are going to look at two window functions, lead() and lag() which help us look for change over time. (For a fuller explanation of window functions, see the related dplyr vignette.)

To understand what a window function does, it is helpful to compare it to a transformation function and an aggregation function. Suppose we have a vector with five numeric values.

original <- c(1.1, 2.2, 3.3, 4.4, 5.5)

A transformation function changes each element in the vector and returns a new value for each. In the example below, we round each element in the vector. We have a different result, but it still has five elements.

round(original, 0)

## [1] 1 2 3 4 6

In an aggregation function, we pass in a vector of numbers and get back a single value. In this case, we get the sum of the numbers.

sum(original)

## [1] 16.5

Aggregation functions work well with summarize(); transformation functions works well with mutate().

A window function gives back a vector of numbers, but a vector which has fewer useable elements than the original. It is like sliding a window over the vector. Consider the case below.

lead(original)

## [1] 2.2 3.3 4.4 5.5  NA

lag(original)

## [1]  NA 1.1 2.2 3.3 4.4

The function lead() returns the next element of a vector in place of the original value. At the end of the vector we get an NA because there are no more elements left. The function lag() does the opposite, giving us the previous element in the vector. In that case, the first element of the returned vector is NA.

The lead() and lag() functions are useful for comparing one value to its previous or successor value. Suppose, for instance, that we have a vector of membership figures for each year. We can calculate the number of new members each year by subtracting the current value from its previous value.

membership <- c(100, 150, 250, 400, 600)
membership - lag(membership)

## [1]  NA  50 100 150 200

Now that we understand those basics, we can apply that to the Methodist annual minutes data that we worked with in a previous lesson. Let’s start by getting just the membership data from Fairfax, Virginia. We will also calculate the members_general value for the years it is missing, and select only the columns we absolutely need.

fairfax <- methodists %>% 
  filter(meeting == "Fairfax") %>%
  mutate(members_general = ifelse(is.na(members_general),
                                  members_white + members_colored,
                                  members_general)) %>% 
  select(minutes_year, meeting, starts_with("members"), -members_indian)
fairfax

## Source: local data frame [54 x 5]
## 
##    minutes_year meeting members_general members_white members_colored
##           (int)   (chr)           (int)         (int)           (int)
## 1          1775 Fairfax              30            NA              NA
## 2          1776 Fairfax             350            NA              NA
## 3          1777 Fairfax             330            NA              NA
## 4          1779 Fairfax             309            NA              NA
## 5          1780 Fairfax             361            NA              NA
## 6          1781 Fairfax             301            NA              NA
## 7          1782 Fairfax             362            NA              NA
## 8          1783 Fairfax             310            NA              NA
## 9          1784 Fairfax             317            NA              NA
## 10         1786 Fairfax             260           260               0
## ..          ...     ...             ...           ...             ...

Now that we have the data, we can add a column for the number of new members added each year.

fairfax %>% 
  mutate(growth = members_general - lag(members_general))

## Source: local data frame [54 x 6]
## 
##    minutes_year meeting members_general members_white members_colored
##           (int)   (chr)           (int)         (int)           (int)
## 1          1775 Fairfax              30            NA              NA
## 2          1776 Fairfax             350            NA              NA
## 3          1777 Fairfax             330            NA              NA
## 4          1779 Fairfax             309            NA              NA
## 5          1780 Fairfax             361            NA              NA
## 6          1781 Fairfax             301            NA              NA
## 7          1782 Fairfax             362            NA              NA
## 8          1783 Fairfax             310            NA              NA
## 9          1784 Fairfax             317            NA              NA
## 10         1786 Fairfax             260           260               0
## ..          ...     ...             ...           ...             ...
## Variables not shown: growth (int)

In what years did the Methodists grow in Fairfax? In what years did they shrink? Can you plot the growth?
How might you use the growth figures to detect errors in the data? Keep in mind that they might have shunk or grown because of the way that they were counted.

Find the growth in the number of white and black members in Fairfax, and plot them on the same chart. What is the pattern?
Return back to the original methodists data. Beginning in 1802, the Methodists organized the data by conference, district, and meeting. For 1802 and following, calculate the growth for each conference. Which conferences were growing the most in absolute terms? Which were growing the most in relative terms (i.e., growth percentage)? Do you get a clearer picture by looking at the growth in districts? Feel free to plot the data if you wish and to add explanatory text.

And indeed, we could write a function that translates state names into regions.↩