Stata to R: How to Tabulate a Categorical Variable

Stata to R: How to Tabulate a Categorical Variable

When working with a data set, one of the first things I do is look at the count and relative frequency of categorical variables of interest. In Stata, this is relatively straight forward with the tab command. In R, however, it isn’t quite as straight forward, but still possible via the dplyr package. 

You might also be interested in my other posts on getting started with R:

How to Rename Variables in R – Stylized Data

How to Recode Factor and Character Variables in R – Stylized Data

Here is how you do it.

# First, let's create a practice data set in R, called “gimmeCaffeine.”
# It has 4 variables (roast, origin, monthProduced, rating). We will use dplyr::trible
# to create a tidy data set.

# Load the dplyr package

library(dplyr)

gimmeCaffeine <- tribble(
                   ~coffee,    ~origin,    ~monthProduced, ~rating,
                   "light",    "colombia", "3",            "A+",
                   "medium",   "ethiopia", "12",           NA,
                   "dark",     "peru",     "7",            "B")

# Where “gimmeCaffeine” is the name of your dataset and “coffee” is whichever
# variable you want to tabulate. I know this seems like a lot of code compared 
# to Stata (and it is) but you can copy/paste it easily into R. This code is just
# manually calculating what Stata does behind the scenes with tab.

gimmeCaffeine %>%
  group_by(coffee) %>%
  summarise(n = n()) %>%
  mutate(
    totalN = (cumsum(n)),
    percent = round((n / sum(n)), 3),
    cumuPer = round(cumsum(freq = n / sum(n)), 3)) 

# If your variable has some missing data that you don't want to include in the 
# tab (like "rating" does), then you just "filter" it out like this

gimmeCaffeine %>%
  filter(!is.na(rating) %>%
  group_by(coffee) %>%
  summarise(n = n()) %>%
  mutate(
    totalN = (cumsum(n)),
    percent = round((n / sum(n)), 3),
    cumuPer = round(cumsum(freq = n / sum(n)), 3)) 

Although this is a decent chunk of code for a simple task, I usually just copy/paste the chunk of code to the top of my R script, so I am not actually typing out the code each time I use it. Also, it gives you a feel for how thing are done in the tidyverse. filter, group_by, summarise, and mutate are all common and once you get used the grammar of the tidyverse, it becomes logical and even easy. You will feel like you have full control over your data set.

P.S. – As a commenter pointed out, there is also the epiDisplay package that achieves a similar result via the tab1 command. I personally like the dplyr method better, but check it out if you are interested!

Analysis done using R and RStudio.

3 thoughts on “Stata to R: How to Tabulate a Categorical Variable

  1. The trick is, to do it with the tidyverse in a consitent way/language and not to load new packages for every single step. Love it, thank you very much for this post.

Leave a Reply

Your email address will not be published.