Stata to R: How to Tabulate a Categorical Variable

Stata to R: How to Tabulate a Categorical Variable

When working with a data set, one of the first things I do is look at the count and relative frequency of categorical variables of interest. In Stata, this is relatively straight forward with the tab command. In R, however, it isn’t quite as straight forward, but still possible. The equivalent code in R, is:

dataset %>%
      group_by(var) %>%
      summarise(n = n()) %>%
      mutate(totalN = (cumsum(n)),
             percent = round((n / sum(n)), 3),
             cumpercent = round(cumsum(freq = n / sum(n)), 3))

Where “dataset” is the name of your dataset and “var” is whichever variable you want to tabulate.
To use this code, you need to have the R library dplyr installed. Here is a reproducible example. 
First, download and install the needed package:

install.packages(dplyr)
 
library(dplyr)

Next we will create a new data set in R, called “dataset.” It has 5 variables (i.e., var1, var2, etc). 
tribble is just a command to quickly and easily enter an example data set into R.

dataset <- tribble(
           ~var1, ~var2, ~var3, ~var4, ~var5,
           "1",   "1",   "1",   "a",   "d",
           "2",   "2",   "2",   "b",   "e",
           "3",   "3",   "3",   "c",   "f")

If we want to “tabulate” var1 in R, then we run the following code:

dataset %>%
      group_by(var1) %>%
      summarise(n = n()) %>%
      mutate(totalN = (cumsum(n)),
             percent = round((n / sum(n)), 3),
             cumpercent = round(cumsum(freq = n / sum(n)),3))

This gives the following result:

# A tibble: 3 x 5
   var1     n totalN percent cumpercent
              
      1     1      1   0.333      0.333
      2     1      2   0.333      0.667
      3     1      3   0.333      1.000

This gives us something very similar to what Stata would produce, albeit with more lines of code to write. We have: 1) each level of var1, 2) the total number within each level, 3) cumulative number of observations, 3) percent contribution of each level, and 4) cumulative percent contribution. There is a way to turn this block of code into a function so you don’t have to write it out each time, but we will save that for another lesson. For now, it is easy enough to just put variables one-by-one into group_by.

Analysis done using R and RStudio.

Leave a Reply

Your email address will not be published. Required fields are marked *