Life is Like a Box of Samples, You Never Know What You’re Gonna Get

Life is Like a Box of Samples, You Never Know What You’re Gonna Get

Life is Like a Box of Samples, You Never Know What Your're Gonna Ge

Good ol’ Forrest Gump. He is right, you never know what you’re going to get, and that is why we use resampling techniques in data science! In this post, I will review two popular resampling techniques for predictive models and give examples of how to implement them in R. But first, let’s review the basics.

What does “resample” mean?

Often we only have one dataset to examine and use to build prediction models. Let’s call this dataset our sample. Resampling just means that we are splitting our sample into bins and/or are taking multiple draws from our sample. You can think of it as creating new samples from within our existing sample. We are not adding new data or new samples.

Why Resample?

Resampling allows us to see how well our model performs on “new”1data. It helps us in a few key ways.

  1. It helps avoid overfitting a model to the peculiarities of our sample. It is possible to build a model that performs very well on your original sample, but performs poorly on a new sample. Think of it this way: if you get a tailored shirt that perfectly fits your body, you wouldn’t expect it to fit most other people, right? Same thing with predictive models. We want to make a shirt that fits most people, not one that only fits a few.
  2. It allows us to check model performance metrics. For example, we can get estimated accuracy, kappa, sensitivity and specificity on “new” data. Even if a model isn’t overfit, we still want to check its performance.
  3. It allows us to compare model performance between different types of models. Since we can use the same resamples with different models, we can easily check how different models perform.

Resampling Techniques

I will cover two popular resampling techniques.

  • k-fold cross-validation. Imagine you have a dataset with 1,000 observations. You take the 1,000 observations and randomly split them into 10 groups (i.e., folds). You remove one fold, and run your model on the remaining 9 folds collectively. You then repeat this process 5, 10, 15 or 20 times (the most common is 10 times as it balances computing time and bias). Each time you leave out 1 of the 10 folds. You then average the results of each of the 10 prediction models. This is k-fold cross-validation! It is nice because you use all the data for training and testing. This is a very popular resampling technique. You can also repeat k-fold cross-validation multiple times, which is common, but I won’t cover that in this post.
  • Bootstrapping. Imagine again you have a dataset with 1,000 observations. This time you take a random sample with replacement until you have a new dataset with 1,000 observations. Since it is with replacement, some observations will be drawn multiple times, but that is okay. The observations that are drawn are used for the training dataset, and the ones that aren’t are used for the testing dataset. The samples not selected are commonly referred to at the “out-of-bag” sample.

Now I will show how each of these resampling techniques are done in R. I will only be showing the resampling techniques and not how they are used for validation. I think it is helpful to see how they work on their own. Because once you understand them on their own, then they will make much more sense when applied to validating predictive models.

I think the best way to learn is with real data. For these examples, I will be using the 2015-2016 cycle of NHANES.

K-fold cross-validation

R makes it pretty simple to do k-fold cross-validation via the createFolds function of the caretpackage. Below I present the R code for creating the folds. You can download the demographic NHANES data here.

################################################################################
# General Information                                                          #
################################################################################

# This is an RScript for doing resampling via k-fold cross validation and
# bootstrapping with NHANES data. You can find a narrative to this script at
# https://stylizeddata.com/life-is-like-a-box-of-samples-you-never-know-what-
# youre-gonna-get/

################################################################################
# Packages                                                                     #
################################################################################

# For reading SAS XPT file from NHANES website
# haven::read_xpt

library(haven)

# For resampling
# caret::createFolds

library(caret)

################################################################################
# Load the dataset                                                             #
################################################################################

# Import NHANES demographic data

nhanesDemo <- read_xpt(url("https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT"))

################################################################################
# Resampling                                                                   #
################################################################################

# Because resampling is random, we need to set a seed so we can reproduce our
# exact results

set.seed(2018)

#-------------------------------------------------------------------------------
# k-fold cross-validation
#-------------------------------------------------------------------------------

# k-folds cross-validation. Here we are actually creating the folds. To keep
# the proportions equal in a key variable across the folds, you need to use "$".
# In this example, I am simply using the ID variable in NHANES "SEQN." If I was
# working on an issue where it is critical to keep a key variable balanced, say,
# proportion of race/ethnicity, I could put that variable after the "$".

# "createFolds" doesn't create 10 separate datasets. Instead, it creates a list
# of each fold and indices of which row belongs to each fold.

folds <- createFolds(nhanesDemo$SEQN, k = 10)

# Now we will bring in a few of the folds to examine

fold1 <- nhanesDemo[folds[[1]], ]

fold2 <- nhanesDemo[folds[[2]], ]

fold3 <- nhanesDemo[folds[[3]], ]

# The folds look good! See how each fold has between 1,016 - 1,019 observations?
# That makes sense because createFolds randomly split the original data set of
# 10,175 into 10 roughly equal folds.

# Remember, each row is assigned randomly a fold. In other words, it
# should only appear in 1 of the 10 folds. Just for learning purposes, let's
# merge a few of the folds. There should be no matches, and that is exactly what
# happens (i.e., mergeCheck1,2, and 3 have 0 observations).

mergeCheck1 <- merge(fold1, fold2, by = "SEQN")

mergeCheck2 <- merge(fold1, fold3, by = "SEQN")

mergeCheck3 <- merge(fold2, fold3, by = "SEQN")

#-------------------------------------------------------------------------------
# Bootstraps
#-------------------------------------------------------------------------------

# Creating the bootstraps is very similar. We will use the same data. This will
# create 10 bootstrap samples.

# You can see that the code below creates a list of 10 bootstrap samples. Each
# has exactly the same number of observations as the original data set, just as
# expected.

bstraps <- createResample(nhanesDemo$SEQN, times = 10)

# Just like above, let's pull a couple of the bootstrap samples into the global
# environment so we can examine.

strap1 <- nhanesDemo[bstraps[[1]], ]

strap2 <- nhanesDemo[bstraps[[2]], ]

# Success! You can see there are 10,175 observations and 47 variables, just like
# the original data set. One thing is quite different from the k-fold cross-
# validation though. In each bootstrap sample, there are duplicate rows. This is
# by design, as bootstrapping does sampling with replacement.

That’s it! This is the long way of doing these resampling techniques. caret pretty much automates this process when validating predictive models, which is a huge time saver. Nonetheless, it is still good to see how they work on their own.

Analysis done using R and RStudio.

Leave a Reply

Your email address will not be published. Required fields are marked *