• About
    • Bio
    • CV
    • Now
  • Research
    • Publications
  • Resources
    • Overview
    • Python
    • Quarto
    • R
    • Other
  • Contact
    • Email
    • Twitter
    • LinkedIn

R

How-to Guides and Useful Links

base

Preventing scientific notation: https://stackoverflow.com/questions/25946047/how-to-prevent-scientific-notation-in-r

package management

My current preferred package management workflow involves creating virtual environments with renv.

My (previously) preferred package for package management is pacman. Before loading in dependencies, put this at the top of the script

if(!require("pacman")) install.packages("pacman")

file/path management

I use the here() package for file/path management.

To reset home path (tidyverse equivalent of setwd()): set_here(). It’s a superseded function, but I don’t really like the replacement

dplyr

Combine with purrr::map to read in multiple csvs to one data frame https://www.mjandrews.org/blog/readmultifile/

Useful functions that I am constantly forgetting: na_if and rowwise() (group_by for rows)

NOTE: Don’t get stuck in the trap of doing row-wise operations if pivoting makes more sense!

slice(1L) for getting the max value of each group

grouped_data <- data %>%
  group_by(variable, group_vars) %>%
  summarize(values = sum(values)) %>%
  mutate(grp = cur_group_id()) %>%
  arrange(-n) %>%
  slice(1L)

recode() values in variables

replace_na() for recoding NA values in variables

Do you want counts of variables in groups without deleting all the other variables? Use mutate() after group_by() instead of summarize. Then subset accordingly. e.g.:

df %>%
  group_by(country_person) %>%
  mutate(
    n_articles_total = n(),
    n_articles_before = sum(before_appoint==1),
    n_articles_after = n_articles_total - n_articles_before,
    n_lang_en = sum(lang_en==1),
    n_lang_other = n_articles_total - n_lang_en,
    av_text_length = mean(length)
  )

df_tidy_subset <- df %>%
  select(
    id, country_person, n_articles_total, n_articles_before,
    n_articles_after, n_lang_en, n_lang_other, av_text_length
  ) %>%
  unique() # rm duplicates

id numbers within groups

df %>% group_by(cat) %>% mutate(id = row_number())

String/character vector manipulation

Remove all characters that are non-numeric: STRING <- str_remove_all(STRING, "\\D+")

Extract substring between two strings: qdapRegex::ex_between()

purrr

map

On reading in multiple files and combining result of a function into a data frame: https://clauswilke.com/blog/2016/06/13/reading-and-combining-many-tidy-data-files-in-r/

lubridate (/working with dates in general)

Create date object from year and month columns with ym() function (goes for a bunch of different ymd combinations as well). e.g.:

df %>%
  mutate(
    date = ym(paste(Year, Month))
  )

Manipulating Twitter Data

(NOTE: as of July 2023, there is extremely limited researcher access to Twitter data)

I used to collect my data using the twarc Python package, but work with my data in R. See code below as an example for wrangling the JSON strings from entities variables.

You might have to make things more complex if you want to also add where tweets came from, but hopefully the snippet below provides a good starting point!

tweets_entities <- tweets %>% 
  filter(entities.annotations != "") %>% # for some reason drop_na not working
  mutate(entities.annotations = gsub("\"\"", "\"", entities.annotations)) 

entities <- map(tweets_entities$entities.annotations, fromJSON) %>% 
  bind_rows() %>% 
  select("type", "normalized_text") %>% 
  distinct()

people <- entities %>% 
  filter(type == "Person")

ggplot2

If you’re a Python user, stick to the Grammar of Graphics and use the plotnine library for visualization :D

Heatmap snippet

fig_df |> 
  ggplot(aes(x = country, y = account_type, fill = n)) + 
  geom_tile(color = "white") +
  geom_text(aes(label = n), color = "white", size = 15) + 
  coord_fixed() + 
  scale_fill_viridis(end = 0.7)

Faceting

Different categorical x-axes https://stackoverflow.com/questions/45019839/ggplot2-different-facet-width-for-categorical-x-axis

Themes

This is the theme_set() that I might use for now.

# add fonts (this might not be a necessary step)
showtext::font_add_google(name = "Fira Sans", family = "fira")
showtext::font_add_google(name = "Roboto", family = "roboto")

# themes and text defaults
theme_set(
  theme_minimal() +
    theme(
      legend.position = "bottom",
      plot.title = element_text(family = "fira"),
      text = element_text(family = "roboto")
    )
)

Labels

Use str_wrap() around different graphic elements to automatically wrap captions/text/legend labels. Sample code below:

top_df %>%
  ggplot(aes(x = date, y = as.numeric(rank), color = str_wrap(game, 20))) +
  geom_point() +
  geom_bump() +
  scale_y_reverse(limits = c(10, 1), n.breaks = 10) +
  labs(
    title = "Top Games Streamed on Twitch",
    subtitle = str_wrap("Games shown are a subset of data with the top 200 ranked games over time. Each of these games have consistently ranked in the top 200, but not necessarily top 10 throughout the years.")
  ) +
  guides(col = guide_legend(ncol = 3))

If you want to wrap legend labels but keep factor levels, use the following helper function (thanks Hadley Wickham!)

# for wrapping legend labels while keeping original factor levels
# https://github.com/tidyverse/stringr/issues/107
str_wrap_factor <- function(x, ...) {
  levels(x) <- str_wrap(levels(x), ...)
  x
}

How to customize which legends are shown based on aesthetic: guides(). Example:

data %>%
  ggplot(aes(x = type, y = fct_rev(abb), size = n, color = n)) +
  geom_point() +
  labs(
    title = "TITLE",
    x = "",
    y = "",
    color = "",
    caption = "Data Source: DATA_SOURCE\nVisualization: Allison Koh"
    ) +
    guides(size = "none")

Fonts

{extrafont} and {showtext} are useful for adding different fonts to viz. The former is for loading in existing fonts from system, the latter is for making sure your text shows up in all graphics(and for loading in fonts from google and other places).

LIFE HACK (or more likely, common sense thing that I often forget): Make sure to include font families in theme_set() at the top of a script instead of in individual graphics.

Useful lines of code for {extrafont} are as follows:

# load in system fonts
extrafont::load_fonts()

# show font names
fonts()

# show a data frame of all fonts available
fonttable()

Useful lines of code for {showtext} are as follows:

showtext_auto() # put at the beginning of a script to automatically show text in new graphics devices

geom_bernie()

Add Bernie Sanders to your plots :D because why not

Install

remotes::install_github("R-CoderDotCom/ggbernie@main")

Geom

geom_bernie(aes(x = 1930, y = 20100), bernie = "sitting")

Alt Text

Helper Function

# helper function for writing alt text
# https://twitter.com/thomas_mock/status/1375853258145734660
write_alt_text <- function(
  chart_type,
  type_of_data,
  reason,
  misc,
  source
){
  glue::glue(
    "{chart_type} of {type_of_data} where {reason}. \n\n{misc}\n\nData source from {source}"
  )
}

Examples

The {TidyTuesdayAltText} package contains examples of AltText from #TidyTuesday posts between 2019 and 2021.

A future version of this package will include an annotated dataset of alt text + ratings according to feature: https://twitter.com/spcanelon/status/1405488036989870080. Until it is integrated into the package, the data can be found here: https://github.com/spcanelon/csvConf2021/blob/main/data/annotatedRubric1.csv

devtools::install_github("spcanelon/TidyTuesdayAltText

Palette from picture with {paletteR}

https://datascienceplus.com/how-to-use-paletter-to-automagically-build-palettes-from-pictures/

devtools::install_github("andreacirilloac/paletter")
create_palette(image_path = "~/Desktop/410px-Piero_della_Francesca_046.jpg",
               number_of_colors =20,
               type_of_variable = “categorical")

Test palette with pie() function

pie(rep(1, 13), col=pal)

Combining PDFs

https://stackoverflow.com/questions/17552917/merging-existing-pdf-files-using-r

install.packages("qpdf")
qpdf::pdf_combine(input = c("file.pdf", "file2.pdf", "file3.pdf"),
                  output = "output.pdf")

RSelenium

Reset port (for error message: Selenium server signals port = 4444 is already in use.) https://stackoverflow.com/questions/74708282/rselenium-is-not-working-when-creating-servers

library(qdapRegex)

#clear busy port in windows
port <- 4444L
tintern <- system("netstat -a -n -o",intern=T)
irow1 <- grep(as.character(port),tintern)
if(length(irow1)>0){
  irow1 <- irow1[1]
  if(!is.na(irow1)){
    irow1 <- irow1[1]
    trow <- tintern[irow1]
    trow <- trimws(rm_white(trow))
    tpid <- word(trow,-1,-1) 
    system(paste0("taskkill /pid ",tpid," /F"))
    
  }
}
Back to top

Allison Koh 2025 • Made with and Quarto