How to Use Survey Weights in R

How to Use Survey Weights in R

Survey weights are common in large-scale government-funded data collections. For example, NHIS and NHANES are two large scale surveys that track the health and well-being of Americans that have survey weights. These data collections use complex and multi-stage survey sampling to ensure that results are representative of the U.S. population. Although use of survey weights is sometimes contested in regression analyses, they are needed for simple means and proportions. The general guidance is that if analysts can control for the factors that were used to create the weights in their analyses, then using weights might not be necessary and will inflate standard errors (and therefore p-values). In some of my analyses, however, there are variables used to create weights that I do not have access to such as geographic variables or specific household demographics, and therefore, I use the weights.

In this post, I will show how to use the excellent survey package in R to “attach” the survey weights to NHANES data. I then show how to calculate a simple mean and regression that incorporate the weights. In addition, I show how to limit the analysis to a subpopulation.

################################################################################
# General Information #
################################################################################
 
# This is an RScript for showing how to use survey weights with NHANES data. You
# can find a narrative to this script at:
# https://stylizeddata.com/how-to-use-survey-weights-in-r/
 
################################################################################
# Packages #
################################################################################
 
# For reading SAS XPT file from NHANES website
# haven::read_xpt
 
library(haven)
 
# For using survey weights
# survey::svydesign, svymean, svyglm
 
library(survey)
 
# For data wrangling
#dplyr::select, mutate, select, recode
 
library(dplyr)
 
 
################################################################################
# Load the dataset #
################################################################################
 
# Import NHANES demographic data
 
nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))
 
################################################################################
# Data Wrangling #
################################################################################
 
# Copy and rename variables so they are more intuitive. "fpl" is percent of the
# of the federal poverty level. It ranges from 0 to 5.
 
nhanesDemo$fpl        <- nhanesDemo$INDFMPIR
 
nhanesDemo$age        <- nhanesDemo$RIDAGEYR
 
nhanesDemo$gender     <- nhanesDemo$RIAGENDR
 
nhanesDemo$persWeight <- nhanesDemo$WTINT2YR
 
nhanesDemo$psu        <- nhanesDemo$SDMVPSU
 
nhanesDemo$strata     <- nhanesDemo$SDMVSTRA
 
# Since there are 47 variables, we will select only the variables we will use in
# this analysis.
 
nhanesAnalysis <- nhanesDemo %>%
                    select(fpl,
                           age,
                           gender,
                           persWeight,
                           psu,
                           strata)
 
# Recode gender
 
nhanesAnalysis <- nhanesAnalysis %>%
                    mutate(gender = recode(gender, `1` = 0L,
                                                   `2` = 1L))
 
 
# Convert "gender" to a factor variable. We need to do this so it isn't treated
# as a continuous variable in our analyses
 
nhanesAnalysis$gender <- as.factor(nhanesAnalysis$gender)
 
################################################################################
# Survey Weights #
################################################################################
 
# Here we use "svydesign" to assign the weights. We will use this new design
# variable "nhanesDesign" when running our analyses.
 
nhanesDesign <- svydesign(id      = ~psu,
                          strata  = ~strata,
                          weights = ~persWeight,
                          nest    = TRUE,
                          data    = nhanesAnalysis)
 
# Here we use "subset" to tell "nhanesDesign" that we want to only look at a
# specific subpopulation (i.e., those age between 18-79 years). This is
# important to do. If you don't do this and just restrict it in a different way
# your estimates won't have correct SEs.
 
ageDesign <- subset(nhanesDesign, age > 17 &
                                  age < 80)
 
################################################################################
# Statistics #
################################################################################
 
# We will use "svymean" to calculate the population mean for age. The na.rm
# argument "TRUE" excludes missing values from the calculation. We see that
# the mean age is 45.648 and the standard error is 0.5131.
 
svymean(~age, ageDesign, na.rm = TRUE)
 
# Since gender is a factor variable, "svymean" will treat it as such and give us
# the proportion of women. We see that men are 48.601% and woman are 51.399% of
# the population in this age of 18 to 79.
 
svymean(~gender, ageDesign, na.rm = TRUE)
 
# Now we will run a general linear model (glm) with a gaussian link function.
# We tell svyglm that nhanesAnalysis is the dataset to use and to apply the
# "svydesign" object "ageDesign." I won't dive into the results here, but you
# can see that age is positively correlated with FPL and that women are
# predicted to have a lower FPL than men.
 
output <- svyglm(fpl ~ age +
                       gender,
          family = gaussian(),
          data   = nhanesAnalysis,
          design = ageDesign)

That’s it! Incorporating survey weights in R is pretty straight forward, thanks to the survey package.

Analysis done using R and RStudio.

5 thoughts on “How to Use Survey Weights in R

  1. Just a follow-up question regarding these NHANES weights. if I’m interested in a NHANES interview exposure and a NHANES laboratory value outcome, do I use both exam weights and interview weights? Or is one more appropriate than another?

    1. If you are using variables that were measured in the Mobile Examination Center (MEC), then you should use the MEC weights, even if you are also using variables from the at-home interview. Also note that some variables were only measured in a subpopulation of NHANES participants, and there are special weights for those variables that should be used over the interview and MEC weights. Check out page 17 of this NHANES guide, it is pretty helpful (https://www.cdc.gov/nchs/data/series/sr_02/sr02_177.pdf).

  2. On the following line a get an error:

    nhanesAnalysis %
    mutate(gender = recode(gender, `1` = 0L,
    `2` = 1L))
    It sais:
    Error in recode(gender, `1` = 0L, `2` = 1L) :
    unused arguments (`1` = 0, `2` = 1)

    1. At it might have to do with me having the car package loaded with a function recode() in it. If so, please delete both my comments and have a nice day!

      1. Yep, you got it. The issue is with the car package. You get the following message upon loading car when dplyr is also loaded:

        The following object is masked from ‘package:dplyr’:

        recode

        If you need to have both packages loaded, you can specify “dplyr::recode” and that will work too.

        I will leave your comment here in case someone else has the same issue!

Leave a Reply

Your email address will not be published. Required fields are marked *