# Stata to R: How to Tabulate a Categorical Variable

When working with a data set, one of the first things I do is look at the count and relative frequency of categorical variables of interest. In Stata, this is relatively straight forward with the `tab`

command. In R, however, it isn’t quite as straight forward, but still possible. The equivalent code in R, is:

dataset %>% group_by(var) %>% summarise(n = n()) %>% mutate(totalN = (cumsum(n)), percent = round((n / sum(n)), 3), cumpercent = round(cumsum(freq = n / sum(n)), 3)) Where “dataset” is the name of your dataset and “var” is whichever variable you want to tabulate. To use this code, you need to have the R library dplyr installed. Here is a reproducible example. First, download and install the needed package: install.packages(dplyr) library(dplyr) Next we will create a new data set in R, called “dataset.” It has 5 variables (i.e., var1, var2, etc). tribble is just a command to quickly and easily enter an example data set into R. dataset <- tribble( ~var1, ~var2, ~var3, ~var4, ~var5, "1", "1", "1", "a", "d", "2", "2", "2", "b", "e", "3", "3", "3", "c", "f") If we want to “tabulate” var1 in R, then we run the following code: dataset %>% group_by(var1) %>% summarise(n = n()) %>% mutate(totalN = (cumsum(n)), percent = round((n / sum(n)), 3), cumpercent = round(cumsum(freq = n / sum(n)),3)) This gives the following result: # A tibble: 3 x 5 var1 n totalN percent cumpercent1 1 1 0.333 0.333 2 1 2 0.333 0.667 3 1 3 0.333 1.000

This gives us something very similar to what Stata would produce, albeit with more lines of code to write. We have: 1) each level of var1, 2) the total number within each level, 3) cumulative number of observations, 3) percent contribution of each level, and 4) cumulative percent contribution. There is a way to turn this block of code into a function so you don’t have to write it out each time, but we will save that for another lesson. For now, it is easy enough to just put variables one-by-one into `group_by`

.