How to Recode Factor and Character Variables in R
Recoding factor and character variable values is a common task in data analysis. Although common, it isn’t as easy as you might expect in R, especially compared to Stata and SAS. I’ve found that people often get the most frustrated with these basic tasks when learning R, so in this post my goal is to take away that frustration.
Often we want to use factor variables that can be dummy coded and easily used in regression models. Other times we want to rename categorical variables so they look better in visualizations or when presenting data.
Note that although we are saying “character” variables, origin
is technically a “string” variable. A string is a set of characters. The terms are often used interchangeably but I wanted you to be aware of the difference.
You might also be interested in my other post: Renaming Variables in R.
# You can copy/paste into the RStudio console or into an R script
# and it will run perfectly!
# Load the needed libraries
library(dplyr)
library(forcats)
# First, let's create a practice data set in R, called
# “gimmeCaffeine.” It has 2 variables (roast and origin).
# We will use dplyr::trible to create a tidy data set.
gimmeCaffeine <- tribble(
~roast, ~origin,
"light", "colombia",
"medium", "ethiopia",
"dark", "peru")
print(gimmeCaffeine)
# Let's say we want "roast" to be a factor variable, with higher
# values indicating a darker roast. We can do this all in one
# block of code.
# This calls dplyr::mutate and forcats::fct_recode
gimmeCaffeine <- gimmeCaffeine %>%
mutate(roastFactor = factor(roast) %>%
fct_recode(
"1" = "light",
"2" = "medium",
"3" = "dark"))
# Here is an explanation of the code, line by line. var = variable.
# apply change in gimmeCaffeine dataset <- use gimmeCaffeine dataset
# make new factor var using old var
# recode instructions
# new value = old value
# new value = old value
# new value = old value
# Notice how "roastFactor" is now recoded as <fct> or factor
print(gimmeCaffeine)
# This is how we can change the factor levels.
gimmeCaffeine <- gimmeCaffeine %>%
mutate(roastFactor = fct_recode(roastFactor,
"0" = "1",
"1" = "2",
"2" = "3"))
# Let's say we want to recode "origin" to be more formal. For example,
# we are creating a graphic for presentation and want to use the
# grammatically correct country name.
# Using dplyr::mutate and dplyr::recode
# Note that recode uses the reverse of fct_recode. With
# recode the guide is old value = new value. This is inconsistent
# and apparently will be fixed in a future release of dplyr::recode
gimmeCaffeine <- gimmeCaffeine %>%
mutate(origin = recode(origin,
"colombia" = "Colombian",
"ethiopia" = "Ethiopian",
"peru" = "Peruvian"))