Stata to R: How to Tabulate a Categorical Variable

# Stata to R: How to Tabulate a Categorical Variable

When working with a data set, one of the first things I do is look at the count and relative frequency of categorical variables of interest. In Stata, this is relatively straight forward with the `tab` command. In R, however, it isn’t quite as straight forward, but still possible. The equivalent code in R, is:

```dataset %>%
group_by(var) %>%
summarise(n = n()) %>%
mutate(totalN = (cumsum(n)),
percent = round((n / sum(n)), 3),
cumpercent = round(cumsum(freq = n / sum(n)), 3))

Where “dataset” is the name of your dataset and “var” is whichever variable you want to tabulate.
To use this code, you need to have the R library dplyr installed. Here is a reproducible example.

install.packages(dplyr)

library(dplyr)

Next we will create a new data set in R, called “dataset.” It has 5 variables (i.e., var1, var2, etc).
tribble is just a command to quickly and easily enter an example data set into R.

dataset <- tribble(
~var1, ~var2, ~var3, ~var4, ~var5,
"1",   "1",   "1",   "a",   "d",
"2",   "2",   "2",   "b",   "e",
"3",   "3",   "3",   "c",   "f")

If we want to “tabulate” var1 in R, then we run the following code:

dataset %>%
group_by(var1) %>%
summarise(n = n()) %>%
mutate(totalN = (cumsum(n)),
percent = round((n / sum(n)), 3),
cumpercent = round(cumsum(freq = n / sum(n)),3))

This gives the following result:

# A tibble: 3 x 5
var1     n totalN percent cumpercent

1     1      1   0.333      0.333
2     1      2   0.333      0.667
3     1      3   0.333      1.000

```

This gives us something very similar to what Stata would produce, albeit with more lines of code to write. We have: 1) each level of var1, 2) the total number within each level, 3) cumulative number of observations, 3) percent contribution of each level, and 4) cumulative percent contribution. There is a way to turn this block of code into a function so you don’t have to write it out each time, but we will save that for another lesson. For now, it is easy enough to just put variables one-by-one into `group_by`.

Analysis done using R and RStudio.