library(tidyverse)
library(here) # for path management
library(rlang)
Functional Programming in R
Summer School for Women in Political Methodology
Course Description
This course offers an introduction to functional programming in R, with applications in political science. Through hands-on exercises, participants will learn how to streamline repetitive tasks, implement modular workflows, and document code for replicability1. Prior experience with the fundamentals of R and RStudio, as well as data management in R, is recommended.
Setup
Before we get started, let’s make sure the following dependencies are installed/loaded for walking through the examples below.
Functions 101
Anatomy of a function
R
functions are made up of three parts:
- Formals (a.k.a. inputs): the parameters or arguments defined by the function
- Body: the code that defines what the function does with its input(s)
- Environment: where the function stores variables and values it uses during execution
You can access the inputs of any function using formals()
or args()
:
formals(tidytext::get_stopwords)
$language
[1] "en"
$source
[1] "snowball"
args(tidytext::get_stopwords)
function (language = "en", source = "snowball")
NULL
You can do the same with the body
or environment
of a function.
body(tidytext::get_stopwords)
{
rlang::check_installed("stopwords", "to use this function.")
tibble(word = stopwords::stopwords(language = language, source = source),
lexicon = source)
}
environment(tidytext::get_stopwords)
<environment: namespace:tidytext>
Discussion Q1 What do we learn about the stopwords
function from accesing its inputs, body, and environment?
Writing functions
In practice, we consider the following when writing functions:
- Name: An intuitive identifier for what the function does
- Argument(s): A list of values passed onto the function
- Body: The code that runs every time a function is called
- Return value: The output that will be sent into the global environment once the function is finished running
You can follow these steps to write functions:
- Write a script that pertains to a specific task that will form the body of your function.
- Come up with a good name:
- Be explicit about exactly what the function does.
- Keep it concise.
- When in doubt, start with a verb that is meaningful to what the function does.
- Add argument(s) to the function’s input.
- Arguments should be based on the parts from the original script that will be removed/abstracted.
- Make sure you return the output of the function to the global environment.
- You can save this to the global environment by assigning the output of the function to an object.
Let’s write our first function! We’ll begin with a relatively simple example: generating synthetic data. This sort of function can be useful for developing research designs and forming pre-analysis plans.
We will first write a function that generates a vector of ages for individuals. First, starting with the script that will turn into the body of the function:
<- rnorm(n = 25, mean = 40, sd = 15) %>%
age round() %>%
pmax(0) %>% # ensures values are not less than 0
pmin(100) # ensures that values are not greater than 100
Now let’s turn this into a function:
- Let’s name it
generate_age_vec
. - Following the code in our original script, our arguments will match those for the
rnorm
function:n
for number of individuals in the vectormean
to establish the average age representedsd
for the standard deviation that will inform the distribution of the vector
<- function(n_people, mean_age, sd_age){
generate_age_vec <- rnorm(n = n_people, mean = mean_age, sd = sd_age) %>%
age round() %>%
pmax(0) %>%
pmin(100)
return(age)
}
Note: Saving age
to an object is not necessary for putting together this function, but assigning parts of the function’s body to locally defined objects will be key to writing more complicated functions.
Now let’s try out our function! Let’s say we want to generate synthetic data for a study targeting university students:
generate_age_vec(n_people = 100, mean_age = 20, sd_age = 1)
[1] 20 17 20 19 20 20 21 20 19 21 22 22 21 20 20 20 19 21 21 22 18 20 21 21 18
[26] 21 18 20 17 21 21 20 20 20 18 22 20 20 20 20 19 21 19 21 21 20 20 20 18 19
[51] 19 18 19 19 20 19 20 22 19 20 20 20 20 19 19 20 19 20 20 21 19 17 19 20 20
[76] 21 21 19 21 21 19 20 19 20 19 19 20 20 20 21 20 20 21 20 21 20 20 19 21 21
Looks good! Let’s say we want the function to default to generating cohorts of synthetic information on 100 college-age individuals. We can add values to the arguments in our original function to do this.
<- function(n_people = 100, mean_age = 20, sd_age = 1){
generate_age_vec <- rnorm(n = n_people, mean = mean_age, sd = sd_age) %>%
age round() %>%
pmax(0) %>%
pmin(100)
return(age)
}
We can then just call the function without specifying any arguments/values:
generate_age_vec()
[1] 20 21 20 20 19 18 20 18 21 19 20 19 19 19 19 21 20 22 18 20 20 20 21 19 20
[26] 22 21 19 20 18 19 21 17 20 19 22 20 19 20 20 18 20 18 20 21 22 19 20 19 23
[51] 20 22 20 21 21 21 20 19 20 20 19 21 20 21 20 21 20 20 20 20 20 19 19 20 18
[76] 18 20 21 19 21 19 20 20 20 21 19 21 20 20 20 22 20 18 19 20 20 19 21 22 20
And if we want to focus on a different age range/cohort, we can override these default values:
generate_age_vec(n_people = 25, mean_age = 35, sd_age = 15)
[1] 33 36 1 50 29 44 39 3 40 20 48 33 33 21 31 47 42 11 50 0 39 48 44 37 24
Building on this, let’s say you want to create a new variable based on this vector that specifies which generation
an individual belongs to:
genX
: 43-58 years oldgenY
: 27-42 years oldgenZ
: 11-26 years old
First, using the function for generating synthetic age data above, let’s create a vector of numbers to work with.
set.seed(77976)
<- generate_age_vec(n_people = 50, mean_age = 40, sd_age = 10)
age age
[1] 33 42 45 35 56 19 19 27 24 32 19 35 56 41 42 36 37 30 57 25 43 39 24 45 29
[26] 40 29 55 46 44 40 35 33 51 50 37 49 30 29 36 44 31 36 52 20 11 35 40 37 37
How would we do it for this specific age vector?
<- cut(
generation
age,breaks = c(10.5, 26.5, 42.5, 58.5),
labels = c("genZ", "genY", "genX")
)
Exercise 1 Write a function that codes generations based on a vector of ages, then apply it to our age vector:
code_generation <- function(age){
ADD FUNCTION BODY HERE
}
code_generation(age)
[1] genY genY genX genY genX genZ genZ genY genZ genY genZ genY genX genY genY
[16] genY genY genY genX genZ genX genY genZ genX genY genY genY genX genX genX
[31] genY genY genY genX genX genY genX genY genY genY genX genY genY genX genZ
[46] genZ genY genY genY genY
Levels: genZ genY genX
Applying functions to data
For applied examples, we will use the Crowd Counting Consortium’s U.S. Protest Event Data (Ulfelder 2025) as our basis for writing data frame functions. This dataset provides event data on contentious politics in the United States. Events of interest comprise of “any type of activity that…is carried out with the explicit purpose of articulating a grievance against a [political] target, or expressing support of a [political] target” and meet the following inclusion criteria:
- Open to the public and free of charge
- Nonviolent: not primarily organized to cause direct harm to any persons
- Occur in the United States, including Puerto Rico, Guam, and the U.S. Virgin Islands
The data comprise of three phases. We will be working with data from Phase 2 of the project, which covers events from 2021 to 2024.
If you were to use this data in your work IRL, writing functions that apply to this dataset would make a lot of sense since there are two other “waves” of event data that are similarly structured.
Let’s start by importing the dataset.
<- read_csv(here("data", "ccc_compiled_20212024.csv"), show_col_types = FALSE) ccc_data
Let’s take an initial peek at the data.
str(ccc_data)
spc_tbl_ [139,823 × 72] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ date : Date[1:139823], format: "2021-01-01" "2021-01-01" ...
$ locality : chr [1:139823] "Montgomery" "Tucson" "Lafayette" "Palo Alto" ...
$ state : chr [1:139823] "AL" "AZ" "CA" "CA" ...
$ location_detail : chr [1:139823] "statewide" "E Speedway Blvd and N Euclid Ave" "El Curtola Blvd & Highway 24" "El Camino Real and Embarcadero Rd" ...
$ online : num [1:139823] 0 0 0 0 0 0 0 0 0 0 ...
$ type : chr [1:139823] "strike; boycott" "vigil" "demonstration" "vigil" ...
$ title : chr [1:139823] NA NA NA NA ...
$ macroevent : chr [1:139823] NA NA NA NA ...
$ organizations : chr [1:139823] "Free Alabama Movement" "Women in Black" "Contra Costa County Patriots" "WILPF" ...
$ participants : chr [1:139823] "prisoners" NA NA NA ...
$ claims : chr [1:139823] "against prison labor, for safer conditions in Alabama prisons during coronavirus pandemic" "for ending Israel's occupation of Palestine, for Palestinian liberation" "for President Trump" "for peace, against war, for banning nuclear weapons" ...
$ claims_summary : chr [1:139823] "against prison labor;for safer conditions in Alabama prisons during coronavirus pandemic" "for ending Israel's occupation of Palestine;for Palestinian liberation" "for President Trump" "for peace;against war;for banning nuclear weapons" ...
$ claims_verbatim : chr [1:139823] NA NA NA NA ...
$ issue_tags_summary : chr [1:139823] "covid; criminal justice; labor" "foreign affairs" "presidency" "military" ...
$ issue_tags_verbatim : chr [1:139823] NA NA NA NA ...
$ issue_tags : chr [1:139823] "covid;criminal justice;labor" "foreign affairs" "presidency" "military" ...
$ valence : num [1:139823] 1 0 2 0 2 1 0 0 0 1 ...
$ size_text : chr [1:139823] NA NA NA NA ...
$ size_low : num [1:139823] NA NA NA NA 1000 NA NA NA 21 NA ...
$ size_high : num [1:139823] NA NA NA NA 1000 NA NA NA 21 NA ...
$ size_mean : num [1:139823] NA NA NA NA 1000 NA NA NA 21 NA ...
$ size_cat : num [1:139823] 0 0 0 0 3 0 0 0 1 0 ...
$ arrests : chr [1:139823] NA NA NA NA ...
$ arrests_any : num [1:139823] 0 0 0 0 0 0 0 0 0 0 ...
$ injuries_crowd : chr [1:139823] NA NA NA NA ...
$ injuries_crowd_any : num [1:139823] 0 0 0 0 0 0 0 0 0 0 ...
$ injuries_police : logi [1:139823] NA NA NA NA NA NA ...
$ injuries_police_any : num [1:139823] 0 0 0 0 0 0 0 0 0 0 ...
$ property_damage : chr [1:139823] NA NA NA NA ...
$ property_damage_any : num [1:139823] 0 0 0 0 0 1 0 0 0 0 ...
$ chemical_agents : num [1:139823] 0 0 0 0 0 0 0 0 0 0 ...
$ participant_measures: chr [1:139823] NA NA NA NA ...
$ police_measures : chr [1:139823] NA NA NA NA ...
$ participant_deaths : logi [1:139823] NA NA NA NA NA NA ...
$ police_deaths : logi [1:139823] NA NA NA NA NA NA ...
$ source_1 : chr [1:139823] "https://sfbayview.com/2020/12/fam-launches-30-day-economic-blackout/" "https://www.tucsonpeacecalendar.org/events/" "https://www.sfgate.com/news/bayarea/article/Council-To-Ask-State-To-Remove-Highway-Overpass-16006373.php" "https://www.facebook.com/PPAWILPF/" ...
$ source_2 : chr [1:139823] "https://twitter.com/ShutDownRacism/status/1346193901313531906" "https://www.youtube.com/watch?v=8y4kFtLSPPs" NA "correspondence with organizer" ...
$ source_3 : chr [1:139823] NA NA NA NA ...
$ source_4 : chr [1:139823] NA NA NA NA ...
$ source_5 : chr [1:139823] NA NA NA NA ...
$ source_6 : chr [1:139823] NA NA NA NA ...
$ source_7 : chr [1:139823] NA NA NA NA ...
$ source_8 : chr [1:139823] NA NA NA NA ...
$ source_9 : chr [1:139823] NA NA NA NA ...
$ source_10 : chr [1:139823] NA NA NA NA ...
$ source_11 : chr [1:139823] NA NA NA NA ...
$ source_12 : chr [1:139823] NA NA NA NA ...
$ source_13 : chr [1:139823] NA NA NA NA ...
$ source_14 : chr [1:139823] NA NA NA NA ...
$ source_15 : chr [1:139823] NA NA NA NA ...
$ source_16 : chr [1:139823] NA NA NA NA ...
$ source_17 : chr [1:139823] NA NA NA NA ...
$ source_18 : chr [1:139823] NA NA NA NA ...
$ source_19 : chr [1:139823] NA NA NA NA ...
$ source_20 : chr [1:139823] NA NA NA NA ...
$ source_21 : chr [1:139823] NA NA NA NA ...
$ source_22 : chr [1:139823] NA NA NA NA ...
$ source_23 : chr [1:139823] NA NA NA NA ...
$ source_24 : chr [1:139823] NA NA NA NA ...
$ source_25 : chr [1:139823] NA NA NA NA ...
$ source_26 : chr [1:139823] NA NA NA NA ...
$ source_27 : chr [1:139823] NA NA NA NA ...
$ source_28 : logi [1:139823] NA NA NA NA NA NA ...
$ source_29 : logi [1:139823] NA NA NA NA NA NA ...
$ source_30 : logi [1:139823] NA NA NA NA NA NA ...
$ notes : chr [1:139823] "Scheduled to run 30 days." "every Friday since at least 2001; organizers clarified in Jan 2024 email that focus has always been on occupation of Palestine" NA "every Friday; started in 1922, some gaps since" ...
$ lat : num [1:139823] 32.4 32.3 37.9 37.4 34.1 ...
$ lon : num [1:139823] -86.3 -111 -122.1 -122.1 -118.1 ...
$ resolved_locality : chr [1:139823] "Montgomery" "Tucson" "Lafayette" "Palo Alto" ...
$ resolved_county : chr [1:139823] "Montgomery County" "Pima County" "Contra Costa County" "Santa Clara County" ...
$ resolved_state : chr [1:139823] "AL" "AZ" "CA" "CA" ...
$ fips_code : chr [1:139823] "01101" "04019" "06013" "06085" ...
- attr(*, "spec")=
.. cols(
.. date = col_date(format = ""),
.. locality = col_character(),
.. state = col_character(),
.. location_detail = col_character(),
.. online = col_double(),
.. type = col_character(),
.. title = col_character(),
.. macroevent = col_character(),
.. organizations = col_character(),
.. participants = col_character(),
.. claims = col_character(),
.. claims_summary = col_character(),
.. claims_verbatim = col_character(),
.. issue_tags_summary = col_character(),
.. issue_tags_verbatim = col_character(),
.. issue_tags = col_character(),
.. valence = col_double(),
.. size_text = col_character(),
.. size_low = col_double(),
.. size_high = col_double(),
.. size_mean = col_double(),
.. size_cat = col_double(),
.. arrests = col_character(),
.. arrests_any = col_double(),
.. injuries_crowd = col_character(),
.. injuries_crowd_any = col_double(),
.. injuries_police = col_logical(),
.. injuries_police_any = col_double(),
.. property_damage = col_character(),
.. property_damage_any = col_double(),
.. chemical_agents = col_double(),
.. participant_measures = col_character(),
.. police_measures = col_character(),
.. participant_deaths = col_logical(),
.. police_deaths = col_logical(),
.. source_1 = col_character(),
.. source_2 = col_character(),
.. source_3 = col_character(),
.. source_4 = col_character(),
.. source_5 = col_character(),
.. source_6 = col_character(),
.. source_7 = col_character(),
.. source_8 = col_character(),
.. source_9 = col_character(),
.. source_10 = col_character(),
.. source_11 = col_character(),
.. source_12 = col_character(),
.. source_13 = col_character(),
.. source_14 = col_character(),
.. source_15 = col_character(),
.. source_16 = col_character(),
.. source_17 = col_character(),
.. source_18 = col_character(),
.. source_19 = col_character(),
.. source_20 = col_character(),
.. source_21 = col_character(),
.. source_22 = col_character(),
.. source_23 = col_character(),
.. source_24 = col_character(),
.. source_25 = col_character(),
.. source_26 = col_character(),
.. source_27 = col_character(),
.. source_28 = col_logical(),
.. source_29 = col_logical(),
.. source_30 = col_logical(),
.. notes = col_character(),
.. lat = col_double(),
.. lon = col_double(),
.. resolved_locality = col_character(),
.. resolved_county = col_character(),
.. resolved_state = col_character(),
.. fips_code = col_character()
.. )
- attr(*, "problems")=<externalptr>
head(ccc_data)
# A tibble: 6 × 72
date locality state location_detail online type title macroevent
<date> <chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 2021-01-01 Montgomery AL statewide 0 stri… <NA> <NA>
2 2021-01-01 Tucson AZ E Speedway Blvd … 0 vigil <NA> <NA>
3 2021-01-01 Lafayette CA El Curtola Blvd … 0 demo… <NA> <NA>
4 2021-01-01 Palo Alto CA El Camino Real a… 0 vigil <NA> <NA>
5 2021-01-01 Pasadena CA Rose Bowl Stadium 0 rall… Patr… <NA>
6 2021-01-01 San Francisco CA home of U.S. Hou… 0 dire… <NA> <NA>
# ℹ 64 more variables: organizations <chr>, participants <chr>, claims <chr>,
# claims_summary <chr>, claims_verbatim <chr>, issue_tags_summary <chr>,
# issue_tags_verbatim <chr>, issue_tags <chr>, valence <dbl>,
# size_text <chr>, size_low <dbl>, size_high <dbl>, size_mean <dbl>,
# size_cat <dbl>, arrests <chr>, arrests_any <dbl>, injuries_crowd <chr>,
# injuries_crowd_any <dbl>, injuries_police <lgl>, injuries_police_any <dbl>,
# property_damage <chr>, property_damage_any <dbl>, chemical_agents <dbl>, …
Inspecting data
There’s still a lot of noise here. Let’s write some functions that can help us inspect the data in a way that will be more informative for analyzing event data on contentious politics.
We can see from an initial inspection of the data that there are a lot of missing values. We are likely to repeatedly address this issue across other phases of this data project, and potentially across sub-analyses of the same dataset. Let’s write a function to make this less redundant.
First, let’s write a script that inspects the issue of missing values.
<- ccc_data %>%
na_counts summarise_all(~ sum(is.na(.))) %>%
pivot_longer(cols = everything(), names_to = "var", values_to = "n_missing") %>%
filter(n_missing > 0) %>%
arrange(desc(n_missing))
na_counts
# A tibble: 65 × 2
var n_missing
<chr> <int>
1 source_28 139823
2 source_29 139823
3 source_30 139823
4 police_deaths 139822
5 participant_deaths 139809
6 injuries_police 139803
7 source_27 139786
8 source_26 139780
9 source_25 139775
10 source_24 139762
# ℹ 55 more rows
Great, let’s turn the above script into a function.
<- function(data){
count_missing <- data %>%
na_counts summarise_all(~ sum(is.na(.))) %>%
pivot_longer(cols = everything(), names_to = "var", values_to = "n_missing") %>%
filter(n_missing > 0) %>%
arrange(desc(n_missing))
return(na_counts)
}
count_missing(ccc_data)
# A tibble: 65 × 2
var n_missing
<chr> <int>
1 source_28 139823
2 source_29 139823
3 source_30 139823
4 police_deaths 139822
5 participant_deaths 139809
6 injuries_police 139803
7 source_27 139786
8 source_26 139780
9 source_25 139775
10 source_24 139762
# ℹ 55 more rows
Manipulating data
Now we know what we’re working with. Let’s write a function that allows us to manipulate the data for our purposes:
- We are not interested in columns that start with “
source_
”. - We are not interested in keeping any variables with a high number of missings (>500).
- Among the selected columns with <500 missing values, we are not interested in keeping any rows with missing values.
<- na_counts %>%
na_counts_subset filter(
!grepl("^source_", var),
< 500
n_missing
)
na_counts_subset
# A tibble: 10 × 2
var n_missing
<chr> <int>
1 fips_code 126
2 resolved_state 93
3 online 68
4 lat 61
5 lon 61
6 locality 53
7 type 49
8 valence 34
9 state 16
10 claims 9
Now we have a list of variables with limited missing values. Let’s select
for variables with limited missing values and then filter
out any remaining rows with missing values.
This is what the script for that would look like, building off the code we’ve already written:
# Vectors that specify which variables to keep/drop
<- na_counts$var
missings_vars_all <- na_counts_subset$var
missings_vars_keep <- setdiff(missings_vars_all, missings_vars_keep)
missings_vars_drop
missings_vars_drop
[1] "source_28" "source_29" "source_30"
[4] "police_deaths" "participant_deaths" "injuries_police"
[7] "source_27" "source_26" "source_25"
[10] "source_24" "source_23" "source_22"
[13] "source_21" "source_20" "source_19"
[16] "source_18" "source_17" "source_16"
[19] "source_15" "source_14" "injuries_crowd"
[22] "source_13" "source_12" "property_damage"
[25] "source_11" "source_10" "source_9"
[28] "arrests" "source_8" "source_7"
[31] "source_6" "macroevent" "source_5"
[34] "police_measures" "source_4" "source_3"
[37] "issue_tags_verbatim" "participants" "participant_measures"
[40] "claims_verbatim" "size_text" "size_high"
[43] "size_low" "size_mean" "title"
[46] "source_2" "notes" "organizations"
[49] "resolved_county" "location_detail" "issue_tags_summary"
[52] "issue_tags" "resolved_locality" "claims_summary"
[55] "source_1"
# Use `select()` to drop variables with too many missings
<- ccc_data %>%
ccc_clean select(-one_of(missings_vars_drop))
ncol(ccc_data)
[1] 72
ncol(ccc_clean)
[1] 17
# And now filtering out all remaining rows with missing values
<- ccc_clean %>%
ccc_clean drop_na()
nrow(ccc_data)
[1] 139823
nrow(ccc_clean)
[1] 139580
Exercise 2a Write a function that handles missing values. It should:
select
for variables with limited missing values andfilter
out any remaining rows with missing values.
Hint: we can integrate the other function we just wrote into this function.
<- function(data){
handle_missing_values <- count_missing(data)
na_counts
#' Select for variables that are not of interest
#' - too many missings
#' - contains the link to a `source`
<- na_counts %>%
na_counts_subset filter(
!grepl("^source_", var),
< 500
n_missing
)
<- na_counts$var
missings_vars_all <- na_counts_subset$var
missings_vars_keep <- setdiff(missings_vars_all, missings_vars_keep)
missings_vars_drop
missings_vars_drop
# Use `select()` to drop variables with too many missings
<- data %>%
data_subset select(-one_of(missings_vars_drop))
# And now filtering out all remaining rows with missing values
<- data_subset %>%
data_subset drop_na()
return(data_subset)
}
Exercise 2b: Apply this function to create a dataset called ccc_clean
.
<- handle_missing_values(ccc_data)
ccc_clean
nrow(ccc_clean)
[1] 139580
ncol(ccc_clean)
[1] 17
Great, we have a complete dataset to work with! The 17 remaining variables are as follows:
date
: Date of event in YYYY-MM-DD format; multi-day events are recorded as one-day-per-rowlocality
:Name of the locality (i.e., city or town) in which the event took placestate
: Two-letter U.S. postal abbreviation for the state or U.S. territory in which the event took placeonline
: Binary indicator for online-only events. 1 = online, 0 = in-person. Generated from location and event type information in the source datatype
: Type(s) of protest action (e.g. march, protest, demonstration, strike, counter-protest, sit-in), separated with semicolons or commas when more than oneclaims
: Comma-separated text phrases describing what the event was aboutvalence
: Political valence of the event (2=right wing, 1=left wing, 0=other or neither)size_cat
: categorical indicator of crowd size- 0 = unknown
- 1 = 1-99
- 2 = 100-999
- 3 = 1,000 - 9,999
- 4 = 10,000+
arrests_any
: Binary indicator for whether or not any arrests occurred. 1 = yes, 0 = no.injuries_crowd_any
: Binary indicator for whether or not any protesters were reportedly injured. 1 = yes, 0 = no.injuries_police_any
: Binary indicator for whether or not any police officers were reportedly injured. 1 = yes, 0 = no.property_damage_any
: Binary indicator for whether or not protesters reportedly caused any property damage. 1 = yes, 0 = no.chemical_agents
: Binary indicator for whether police or other state security forces used tear gas or other chemical irritants, such as pepper spray or pepper balls, during the protest event. 1 = yes, 0 = no.lat
andlon
: latitude and longitude of the locality in which the event took place as resolved by the Google Maps Geocoding APIresolved_state
: Postal abbreviation of the state or territory in which the event occurred, as resolved by the Google Maps Geocoding API.fips_code
: Five-digit FIPS code for the county in which the event took place
Now let’s say we’re interested in the type of events represented in the dataset. Let’s have a look at the first few values in the type
variable:
$type[1:5] ccc_clean
[1] "strike; boycott" "vigil" "demonstration" "vigil"
[5] "rally; parade"
As shown here, some events have more than one type indicated. Let’s write up some code that expands the data so each row represents an event-type pairing:
<- ccc_clean %>%
ccc_types separate_rows(type, sep = ";") %>%
mutate(type == gsub("\\s+", "", type))
nrow(ccc_clean)
[1] 139580
nrow(ccc_types)
[1] 165730
We’ll take the increased number of rows in ccc_types
as an indication that the data is doing what we want it to do.
Now let’s turn it into a function that separates rows by type:
<- function(data){
separate_rows_by_type <- data %>%
data_expanded separate_rows(type, sep = ";") %>%
mutate(type == gsub("\\s+", "", type))
return(data_expanded)
}
Even if we do not end up using this function for other datasets, it could be useful as a helper function so we can use some shorthand in other functions that work with this expanded version of the dataset.
Visualizing data
Functions can also be helpful for streamlining data visualization. For example, functions that visualize user-specified subsets of data can be especially useful for creating dashboards.
Let’s say we’re interested in generating plots that illustrate event types over time for a given date range. Let’s write a script for illustrating six event types over time: demonstration
, rally
, vigil
, protest
, strike
, and picket
.
<- c("demonstration", "rally", "vigil", "protest", "strike", "picket")
event_types
<- separate_rows_by_type(ccc_clean) %>%
ccc_types_over_time filter(type %in% event_types) %>%
count(date, type)
%>%
ccc_types_over_time ggplot(aes(x = date, y = n, color = type)) +
geom_line() +
facet_wrap(~type)
Now let’s write a function that takes a vector of events and a data frame as input, then outputs this data visualization on types of events.
Here’s what the body of the function will be based on:
<- c("demonstration", "rally", "vigil", "protest", "strike", "picket")
event_types
<- separate_rows_by_type(ccc_clean) %>%
ccc_types_over_time filter(type %in% event_types) %>%
count(date, type) %>%
arrange(desc(n))
%>%
ccc_types_over_time ggplot(aes(x = date, y = n, color = type)) +
geom_line() +
facet_wrap(~type)
<- function(df, event_types = c("demonstration", "rally", "vigil", "protest", "strike", "picket")){
visualize_types_over_time
<- separate_rows_by_type(df) %>%
df_types_over_time filter(type %in% event_types) %>%
count(date, type)
%>%
df_types_over_time ggplot(aes(x = date, y = n, color = type)) +
geom_line() +
facet_wrap(~type)
}
visualize_types_over_time(ccc_clean)
Since we include a vector of event types as input for the function, we can use this function to visualize other event types:
visualize_types_over_time(ccc_clean, event_types = c("caravan", "bicycle ride", "walkout"))
Discussion Q2 What other arguments could we add to this function?
Iteration
purrr::map()
Now that we understand how functions work, we can start applying them repeatedly across multiple elements. This process is known as iteration.
In R
, iteration refers to the application of the same operation to every item in a vector, list, or column. This includes:
- Applying a function to every row/column of a dataset
- Calculating summary statistics for every country in a panel dataset
- Cleaning a batch of interview transcripts
The classic way of iterating involves for loops. Let’s say we want to set every word in the following vector to lowercase:
<- c("Democratic", "Authoritarian", "Hybrid")
regime_types
for (word in regime_types) {
print(tolower(word))
}
[1] "democratic"
[1] "authoritarian"
[1] "hybrid"
While this technically would work, it can get verbose when applying code across lists of models, multiple text documents, grouped datasets, and the like. At this scale, for loops can be really hard to read, debug, or reuse. Which is the opposite of what we want to do!
The purrr
package offers a tidier toolkit for iteration. Instead of writing out the for loops, purrr
’s functions will do the iteration for you.
The most common function used from the purrr
package is map()
.
Let’s revisit our text cleaning example from before:
map(regime_types, tolower)
[[1]]
[1] "democratic"
[[2]]
[1] "authoritarian"
[[3]]
[1] "hybrid"
So much easier! It more or less does the same thing, but returns a list.
If you wanted a character vector instead of a list, you can use the map_chr()
function:
map_chr(regime_types, tolower)
[1] "democratic" "authoritarian" "hybrid"
purrr
also has other variants of map()
that produce other types of vectors:
map_int()
: integer vectormap_dbl()
: double vectormap_chr()
: character vectormap_lgl()
: logical vector
Let’s try some more examples with the CCC data.
Let’s start with a tibble containing information on whether arrests, injuries to protesters, or use of chemical agents took place:
<- ccc_clean %>%
incident_df select(
arrests_any,
injuries_crowd_any,
chemical_agents )
str(incident_df)
tibble [139,580 × 3] (S3: tbl_df/tbl/data.frame)
$ arrests_any : num [1:139580] 0 0 0 0 0 0 0 0 0 0 ...
$ injuries_crowd_any: num [1:139580] 0 0 0 0 0 0 0 0 0 0 ...
$ chemical_agents : num [1:139580] 0 0 0 0 0 0 0 0 0 0 ...
All of these variables are binary indicators of whether one of these events took place.
Let’s say that we want to generate a vector that counts how many incidents of each type occurred:
map_int(incident_df, sum)
arrests_any injuries_crowd_any chemical_agents
1776 441 124
The map variants will throw an error if your function doesn’t output the correct type of data.
map_lgl(incident_df, sum)
This is what the error message should look like:
If you’re getting this sort of “Can’t coerce” error, it means that you are using the wrong variant of map function.
Anonymous functions
Until now, we have been only working with named functions. Naming a function is useful when you need to run something more than once. If you only need to apply a function one time, you may want to consider using anonymous functions instead. We can just write the function body inside map()
instead.
Remember the count_missings
function we wrote earlier? Let’s find a more streamlined way to count the number of missings across variables using map()
and anonymous functions:
%>% map_int(~ sum(is.na(.))) ccc_data
date locality state
0 53 16
location_detail online type
9804 68 49
title macroevent organizations
84363 132123 38419
participants claims claims_summary
104366 9 671
claims_verbatim issue_tags_summary issue_tags_verbatim
97272 3525 105490
issue_tags valence size_text
2371 34 92885
size_low size_high size_mean
92783 92793 92783
size_cat arrests arrests_any
0 138045 0
injuries_crowd injuries_crowd_any injuries_police
139382 0 139803
injuries_police_any property_damage property_damage_any
0 139035 0
chemical_agents participant_measures police_measures
0 102520 129013
participant_deaths police_deaths source_1
139809 139822 80
source_2 source_3 source_4
78516 110953 125379
source_5 source_6 source_7
131851 134983 136451
source_8 source_9 source_10
137725 138421 138769
source_11 source_12 source_13
139009 139183 139309
source_14 source_15 source_16
139403 139463 139530
source_17 source_18 source_19
139577 139617 139647
source_20 source_21 source_22
139685 139709 139735
source_23 source_24 source_25
139754 139762 139775
source_26 source_27 source_28
139780 139786 139823
source_29 source_30 notes
139823 139823 59756
lat lon resolved_locality
61 61 812
resolved_county resolved_state fips_code
13546 93 126
If we are trying to calculate missing across multiple datasets, writing a named function may be helpful. But if we just want to count missings on one dataset, this is the way to go.
dplyr::across()
In addition to the map functions, dplyr::across()
also offers options for iteration. It is typically used for data wrangling and applying the same function(s) across columns in a dataset.
Let’s have a look at the documentation for this function:
?across
The three arguments that we should account for are:
.cols
: This tells R which columns in the data frame the function(s) should be applied to. You can pass a character vector,tidyselect
helpers (e.g. starts_with(“foo”)), or use column positions..fns
:.names
(optional): controls how the names of newly constructed columns will be constructed
Building on our example from before, let’s say we are interested in inspecting the number of events where arrests occurred, protesters/civilians were injured, or chemical agents were deployed.
Let’s first specify the three columns within the ccc_clean
database that this would apply to, then apply the sum
function across
these columns
<- c("arrests_any", "injuries_crowd_any", "chemical_agents")
incident_cols
<- ccc_clean %>%
incidents_by_state group_by(state) %>%
summarize(
n_events = n(),
across(all_of(incident_cols), sum)
)
head(incidents_by_state)
# A tibble: 6 × 5
state n_events arrests_any injuries_crowd_any chemical_agents
<chr> <int> <dbl> <dbl> <dbl>
1 AK 267 1 0 0
2 AL 1698 7 4 0
3 AR 485 4 2 0
4 AZ 2445 35 11 4
5 CA 21554 209 96 16
6 CO 2838 17 5 2
Let’s build a custom function to apply to this state-level dataset. Let’s say we are interested in the percentage of events for each type of incident relative to the total number of events in each state.
<- function(x, n_events) {
calculate_percentage_col / n_events * 100
x }
Now let’s apply it to our incidents_by_state
dataset
<- incidents_by_state %>%
incidents_tmp mutate(across(
.cols = all_of(incident_cols),
.fns = ~ .x / n_events * 100,
.names = "perc_{.col}"
))
Finally, let’s check the structure of this dataset to double check that it’s working:
str(incidents_tmp)
tibble [53 × 8] (S3: tbl_df/tbl/data.frame)
$ state : chr [1:53] "AK" "AL" "AR" "AZ" ...
$ n_events : int [1:53] 267 1698 485 2445 21554 2838 1657 5867 172 4104 ...
$ arrests_any : num [1:53] 1 7 4 35 209 17 21 188 3 59 ...
$ injuries_crowd_any : num [1:53] 0 4 2 11 96 5 4 24 0 12 ...
$ chemical_agents : num [1:53] 0 0 0 4 16 2 0 7 0 2 ...
$ perc_arrests_any : num [1:53] 0.375 0.412 0.825 1.431 0.97 ...
$ perc_injuries_crowd_any: num [1:53] 0 0.236 0.412 0.45 0.445 ...
$ perc_chemical_agents : num [1:53] 0 0 0 0.1636 0.0742 ...
Exercise 3a Based on ccc_data
, use across()
and an anonymous function to convert all of the the text in the claims
, organizations
, and title
columns to lowercase.
<- c("claims", "organizations", "title")
text_vars
<- ccc_data %>%
ccc_data_lower mutate(across(all_of(text_vars), ~ tolower(.x)))
%>%
ccc_data_lower select(claims, organizations, title)
# A tibble: 139,823 × 3
claims organizations title
<chr> <chr> <chr>
1 against prison labor, for safer conditions in alabama pr… free alabama… <NA>
2 for ending israel's occupation of palestine, for palesti… women in bla… <NA>
3 for president trump contra costa… <NA>
4 for peace, against war, for banning nuclear weapons wilpf <NA>
5 for president trump, against election fraud, against gov… trump unity … patr…
6 for canceling rent and providing financial relief during… <NA> <NA>
7 for safer streets for bicyclists, for community-building san francisc… <NA>
8 for peace <NA> frid…
9 for free phone calls and video visitations and increased… <NA> <NA>
10 black lives matter, against police brutality the valley o… <NA>
# ℹ 139,813 more rows
Exercise 3b Use map_int()
to create a new column in the ccc_clean
data frame that contains the number of characters in each entry of the claims
column.
Hint: You can use the nchar()
function to count characters in a string.
<- ccc_clean %>%
ccc_clean2 mutate(n_chars = map_int(claims, ~ nchar(.x)))
If the function was written correctly, the following code should work:
$n_chars[1:10] ccc_clean2
[1] 89 71 19 51 104 77 56 9 112 44
Tidy evaluation
If you mainly use tidyverse
functions in your R
workflow, functional programming will be a little more complicated. This is because, when it comes to working with variables programmatically, the tidyverse uses a system called tidy evaluation.
Writing functions that accept variables as arguments introduces a new challenge. This is because they are treated differently between base R
and the tidyverse
:
- In base
R
, variable names are typically treated as placeholders that refer to objects in the global environment. - In contrast,
tidyverse
functions treat variable names as expressions that are evaluated inside the context of the data frame itself. This is what makes the tidyverse more user-friendly and dynamic; if a variable is treated as an expression, you can chain multiple functions together to manipulate data at scale.
This difference introduces a tension when we try to write our own functions. Inside our function, we want to let the user pass in column names, and use those column names in tidyverse functions such as filter()
, ggplot()
, and group_by()
. But R
doesn’t automatically know that those names refer to columns inside a data frame, rather than variables in the function environment.
This is where tidy evaluation comes in. Tidy evaluation is the programming system that powers tidyverse functions, bridging the gap between how variables are interpreted in the global environment and how they are interpreted inside a tidy data frame.
The core tools of tidy evaluation are:
- {{}} (a.k.a. embracing): Tells
R
to treat the argument as a column reference inside a data frame, e.g.:
<- function(data, group_var) {
group_and_summarize %>%
data group_by({{ group_var }}) %>%
summarise(
n_events = n(),
) }
- …: Flexible arguments
When you want to pass a function though several expressions (e.g. filter()
conditions or group_by()
variables), e.g.:
<- function(...) {
df_filter %>% filter(...)
df }
- := (a.k.a. the “walrus operator”): Allows the output name to be created from the input of a function, e.g.:
<- function(data, input_col, output_col) {
lowercase_column %>%
data mutate({{ output_col }} := tolower({{ input_col }}))
}
- !!!: unpacks and passes multiple arguments in as if they were typed out manually, e.g.:
<- c(
valence_map "0" = "none/other",
"1" = "left-wing",
"2" = "right-wing"
)
<- ccc_data %>%
ccc_recoded mutate(valence_label = recode(as.character(valence), !!!valence_map))
Here’s a table that can serve as a small cheat sheet for these four tools:
Tool | What it does | When to use it |
---|---|---|
{ } |
Treats argument as a column name | Always, for tidy column inputs |
... |
Forwards user expressions | For flexible filters, groupings |
:= |
Assigns dynamic column names | When user names the output |
!!! |
Unpacks a list of named arguments | When arguments are stored in a list |
Let’s apply some tidy evaluation to data. There are two variables in the CCC data that include text strings that are separated by semi-colons: type
and claims
. Let’s say we are interested in tabulating the types of events or claims over time:
Let’s revisit the code we wrote for doing this based on type
:
<- ccc_clean %>%
ccc_types separate_rows(type, sep = ";") %>%
mutate(type == gsub("\\s+", "", type))
Now let’s turn it into a function:
<- function(data, var){
separate_rows_by_var <- data %>%
data_expanded separate_rows({{var}}, sep = ";") %>%
mutate({{var}} == gsub("\\s+", "", {{var}}))
return(data_expanded)
}
Great! Now we can separate the rows based on type
or claim
.
separate_rows_by_var(ccc_clean, type)
separate_rows_by_var(ccc_clean, claims)
Exercise 4 Using a dataset of your choice, write a function that applies tidy evaluation.
Bonus exercise (if time)
Exercise 5 Practice writing functions based on an R
script/dataset from your own work.
Additional resources
While we cover a lot of ground on applied examples in this short course, we will not be able to cover everything there is to know about functions. Here are some resources that you can consult on your own time in the event that you want to build on what we learned today!
- Debugging, condition handling, and defensive programming by Hadley Wickham (Advanced R4DS)
- Error Handling in R by Alboukadel Kassambara, PhD
- Functional programming by Sara Altman, Bill Behrman, Hadley Wickham
- Function Documentation by Hadley Wickham and Jennifer Bryan
References
Footnotes
We won’t get to documentation today, but I have added a link on how to do it in the “Additional resources” section.↩︎