How to do fuzzy matching in R

Fuzzy Wuzzy was a bear,
Fuzzy Wuzzy had no hair,
Fuzzy Wuzzy wasn’t very Fuzzy,
Was he?
— Extremely Relevant Children’s Rhyme
Fuzzy matching links two or more non-identical character strings together. Ideally, when linking data sets together, there would be a unique variable that identifies each row (or rows) in each data set. We do not, however, live in an ideal world. Often times when getting data from sources or systems that are not explicitly linked, we won’t have a perfect unique identifier. A lot of times though, we can 1) create a unique identifier with existing information, or 2) match based on similarity between strings. In this post, I will focus on the latter.
So why fuzzy schools? Well, the Virginia Department of Education does a great job of posting data about schools in the state. Most of the data are stored in CSV or XLSX files, which can be easily downloaded and read into R. The problem is that these files do not always have a unique identifier for each school. So linking the files is not that easy. The files almost always include the school name though and a Local Education Agency (LEA, kinda like a school district) identifier. The problem is that the school names are not perfectly formatted or spelled in each data set. For example, one file might have “Mt Vernon High” and another file might have “Mount Vernon High School.” Obviously these are the same school and we could easily match them manually. But when you have 2,000 schools, this would be time consuming. Or imagine doing this across the whole US, where we would have over 100,000 schools. Not fun.
To save a lot of time and reduce errors, we will use R to match the vast majority of rows based on the similarity of school names between two data sets. Here is some basic information before we jump into the code:
- The first file we will use is a report of LEAs and “Schools Eligible and Near Eligible to Participate in the Community Eligibility Provision (CEP) (as of April 1, 2018).” There is a link in the R script of where it can be downloaded, but you can also download it from StylizedData here (XLSX).
- The second file we will use is a list of schools with principal and other contact information. There is a link in the R script of where it can be downloaded, but you can also download it from StylizedData here (CSV).
We want to merge these two files so we have information on CEP eligibility and contact information for each school. Looking through the two files, it is clear that many of the school names match exactly, but a fair amount do not. Our goal is to do a fuzzy match to reduce the number of manual matches we have to do.
################################################################################ # Packages # ################################################################################ library(tidyverse) library(readxl) library(fuzzyjoin) ################################################################################ # Importing files # ################################################################################ # Set your working directory. Even better: set up an R Project! setwd("INSERT PATH TO WEAR YOU WANT FILES SAVED") #------------------------------------------------------------------------------- # List of schools in VA #------------------------------------------------------------------------------- # URL location of file. This is a download link. schoolListUrl <- "http://www.va-doeapp.com/Download.aspx?n=School%20Listing" # Location where R should save the file. Default is your working directory. schoolListFile <- "schoolList.csv" # Actually download the file. "wb" ensures it is binary and can easily be opened. download.file(schoolListUrl, schoolListFile, mode = "wb") # Open and save as R data set. Skipping rows that aren't needed. R knows # to skip the blank rows. schoolList <- read.csv("schoolList.csv", skip = 2, header = TRUE) save(schoolList, file = "schoolList.RData") # Remove for tidy environment rm(schoolListUrl, schoolListFile) #------------------------------------------------------------------------------- # CEP status of schools in VA #------------------------------------------------------------------------------- # URL location of csv. This is a download link. cepSchoolsUrl <- "http://www.doetest.virginia.gov/support/nutrition/statistics/cep/2017-18/2018-4-1-community-eligibility.xlsx" # Location where R should save the file. cepSchoolsFile <- "cepSchools.xlsx" # Actually download the file. "wb" ensures it is binary and can easily be opened. download.file(cepSchoolsUrl, cepSchoolsFile, mode = "wb") # Open and save as R data set. I am specifying how to import each column. It is # annoying, but readxl doesn't allow you to specify which column, you have to # do them all if you want to do one. cepSchools <- read_excel( "cepSchools.xlsx", sheet = 2, skip = 4, col_types = c("numeric", "text", "numeric", "text", "numeric", "text", "text", "text", "text", "text", "text", "numeric", "numeric", "text", "guess")) # Save the file save(cepSchools, file = "cepSchools.RData") # Remove for tidy environment rm(cepSchoolsUrl, cepSchoolsFile) ################################################################################ # Data Cleaning # ################################################################################ # Load data sets load("schoolList.RData") # Renaming some of the columns. The numbers in the bracket indicate the # column position. names(schoolList)[1] <- "leaId" names(schoolList)[5] <- "schoolName" names(schoolList)[9] <- "schoolDesc" load("cepSchools.RData") # Renaming the column names. Note: some were formatted really weirdly. This # renaming takes care of that. names(cepSchools)[1] <- "leaId" names(cepSchools)[4] <- "schoolName" names(cepSchools)[5] <- "isp" names(cepSchools)[6] <- "eligible" names(cepSchools)[7] <- "nearEligible" names(cepSchools)[8] <- "participating" names(cepSchools)[12] <- "identifiedStudents" names(cepSchools)[13] <- "totalStudents" # Subsetting the data sets to keep only the variables I want. schoolSubset <- subset( schoolList, select = c("leaId", "schoolName", "schoolDesc", "Address1", "Address2", "City", "State", "Zip")) cepSubset <- subset( cepSchools, select = c("leaId", "schoolName", "isp", "eligible", "nearEligible", "participating", "identifiedStudents", "totalStudents")) # I am cleaning up some of the names so they better match with using regex. # We are normalizing the strings for a better match. # Note: \\b ensures that only, for example, ELEM is replaced. If you don't do # this, then the "ELEM" in "ELEMENTARY" will also be replaced (ie, ELEMELEMENTARY) cepSubset$schoolName <- gsub("ELEM\\b","ELEMENTARY", cepSubset$schoolName) %>% gsub("CTR","CENTER", .) %>% gsub("ACAD\\b","ACADEMY", .) %>% gsub("PRGM","PROGRAM", .) %>% gsub("MID\\b","MIDDLE", .) %>% gsub("\\.","", .) %>% gsub("\\bCO\\b","COUNTY", .) %>% gsub("\\bED\\b","EDUCATION", .) %>% gsub("\\bMT\\b","Mount", .) %>% gsub("\\bFT\\b","FORT", .) %>% gsub("\\bALT\\b","ALTERNATIVE", .) %>% gsub("\\bINT\\b","INTERMEDIATE", .) %>% gsub("\\bSP\\b","SPECIAL", .) %>% gsub("\\bRD\\b","ROAD", .) %>% gsub("\\bSR\\b","SENIOR", .) %>% gsub("\\bHS\\b","High School", .) %>% gsub("\\-"," ", .) schoolSubset$schoolName <- gsub("\\.","", schoolSubset$schoolName) %>% gsub("\\-"," ", .) %>% gsub("\\bSR\\b","SENIOR", .) # Removes white spaces around single two single letters (A B to AB) cepSubset$schoolName <- gsub("(?<=\\b\\w)\\s(?=\\w\\b)", "", perl = TRUE, cepSubset$schoolName) # Convert "leaId" to a numeric. Note it is critical to use "as.character" here. # Otherwise, R doesn't convert it properly. schoolSubset$leaId <- as.numeric(as.character(schoolSubset$leaId)) # Combining the LEA ID with the school name to better match schools with the # same name in different districts. If I don't do this, it will match two # schools in different districts that have the same name. schoolSubset$schoolId <- as.character(paste(schoolSubset$leaId, schoolSubset$schoolName, sep = " ")) cepSubset$schoolId <- as.character(paste(cepSubset$leaId, cepSubset$schoolName, sep = " ")) # Let's filter out some of the school types we know won't have CEP or can't # be matched. schoolSubset <- filter(schoolSubset, schoolDesc != "Local or Regional Jail") schoolSubset <- filter(schoolSubset, schoolDesc != "Unknown") schoolSubset <- filter(schoolSubset, schoolDesc != "") # This is the actual matching command (fuzzyjoin::stringdist_join). # Notes: # # "mode = left" means we only want to keep rows that match using "cepSubset" # as the reference. # # "ignore_case" means ignore the letter case (e.g. capitals) # # "method = jw" is the Jaro-Winkler matching algorithm. fuzzyjoin has quite a # to choose from, and it is good to experiment to see which one works best. # # max dist is related to the jw algorithm, which ranges from 0-1, with 0 # indicating a perfect match. In this scenario, I only want to keep matches # that have a jw distance of 0.08737 or less. # # distance_col is just naming the distance column. Put whatever you want. # # group_by and top_n allow us to only keep those with values less than 0.08737. matchDistance <- stringdist_join( cepSubset, schoolSubset, by = "schoolId", mode = "left", ignore_case = TRUE, method = "jw", max_dist = 0.087375887, distance_col = "dist") %>% group_by(schoolId.x) %>% top_n(1, -dist) # This is an "anti-join" that allows us to see only those rows in "cepSubset" # that didn't match. noMatch <- stringdist_join(cepSubset, schoolSubset, by = "schoolId", mode = "anti", ignore_case = TRUE, method = "jw", max_dist = 0.087375887, distance_col = "dist")
So we correctly matched 1,731 out of 1,894 schools in the CEP data set. That is a match rate of 91 percent! We saved so much time. And looking through the schools that did not match, it is apparent most of them could easily be linked manually. Also, if you do not normalize the school names between the data sets, only 1,465 match (77 percent). So we gain a lot by using regular expressions (regex) to normalize that data sets.
2 thoughts on “How to do fuzzy matching in R”
Regarding this:
# Note: \\b ensures that only, for example, ELEM is replaced. If you don’t do
# this, then the “ELEM” in “ELEMENTARY” will also be replaced (ie, ELEMELEMENTARY)
If we don’t include \\b, wouldn’t the result be ELEMENTARYENTARY, and not ELEMELEMENTARY? Or am I misunderstanding what we are doing with the regex?
Hi Katy. You are correct! That was on oversight on my part. Thank you for pointing it out. Best, Mike