Not Feeling So Great? A Random Forest Analysis of Demographics and Health

Not Feeling So Great? A Random Forest Analysis of Demographics and Health

Each of us have an intuition of our own health and understand when we are feeling great, or not not so great. Indeed, self-perceived health status even predicts mortality. I started to wonder how well basic demographics, such as age, gender, and income predict self-perceived health status. We know demographics strongly correlate with lots of other health outcomes, but I have never seen them used to predict self-perceived health status.

In this post I use a Random Forest classifier to predict self-perceived health status using only demographic variables. Random Forests are a popular machine learning classifier because they are flexible, powerful and relatively easy to interpret. Before I go on, I wrote a little abstract to give you a sense of the analysis.

Objective: To train a model that predicts self-perceived health status using only demographic characteristics of individuals and the  households in which they reside.

Methods: I use a random forest classifier to predict a binary self-perceived health status variable (excellent/very good vs. good/fair/poor) using basic demographic variables from the 2010-2016 waves of the National Health Interview Survey (NHIS). I restricted the study population to those aged between 18-62 and with household incomes less than 200% of the federal poverty level (FPL). There were 55,120 total observations; 38,585 and 16,535 in the training and testing data sets, respectively.

Results: On a test set of data, the random forest classifier had a sensitivity and specificity of 0.72 and 0.62, respectively. Kappa was 0.34 and accuracy was 0.67. The no-information rate was 0.51.

Conclusion: Using a basic set of demographic variables, I was able to train a model that predicts self-perceived health status moderately better than what we would  be expected by chance.

I chose to use NHIS because of its large sample size and decent set of demographic variables. I limited the sample to those ages 18-62 because I wanted working-age adults. In addition, I wanted to better control for income and create a more homogeneous population, so I included only low-income households.

Working with NHIS can be tricky. For this analysis, I needed the detailed information that is included in the “sample adult” file, but I also needed information from other NHIS files too. I ended up merging the household, family, person, and sample adult filed together and appending them to create the 2010-2016 data set. I then randomly split the data into a training (70%) and testing data sets (30%).

I wanted to only use demographic variables for this analysis for two primary reasons. First, we might have access to data sets where we know a lot about the demographics of individual, but not much else. If we are targeting an intervention at those who are likely to be in the worst health, then predicting perceived health status could be helpful. Second,  I think it is interesting to consider how much basic demographics predict health. If I only know some basic information about you, how close can I get to guessing your perceived health status, an inherently personal thing?

Here is a list of the demographic variables that were used in the analysis:

  • Perceived health status (Excellent/Very Good vs. Good/Fair/Poor)
  • Family size
  • Number of kids in the household
  • Marital status
  • Household income as percent of the federal poverty level
  • Home ownership status
  • Education level
  • Race/ethnicity
  • Gender
  • Age
  • Employment status
  • Region of the US participant resides
  • Quarter of the year survey conducted

For this analysis, I used the caret (pdf) package in R. It is great and I highly recommend it. First, here are some of the details of the model:

  • 10 fold repeated cross validation, repeated 3 times
  • 500 trees
  • 4 variables were chosen at random for each node
  • ROC (i.e., sensitivity and specificity) for metric, with a threshold of 0.55

Let’s take a look at how the error level varies with the number of trees:

Graph of error rate by number of trees

Black is the overall error rate; Red and green are the error rates for excellent/very good and good/fair/poor self-perceived health status, respectively. We see that after about 250 trees, the error rate flattens (meaning we don’t need to use more than about 250 trees). We also see that the error rate for excellent/very good is much lower than for good/fair/poor. Now, let’s see how well the model does on the training data:

  • Sensitivity = 0.73
  • Specificity = 0.63

This isn’t too bad given the limited number of demographic variables we have. Let’s see how well the model does with the test data set:

  • Sensitivity  = 0.73
  • Specificity  = 0.62
  • Kappa        = 0.34
  • Accuracy    = 0.67

In the test data set, 50.1 percent of individuals actually had good/fair/poor health. As expected from the graph above, our “test” does better at classifying people with excellent/very good health who actually have excellent/very good health. Kappa, which takes into account observed and expected accuracy, is 0.34, which isn’t great, but again, pretty decent given our limited set of predictors.

The 5 most important variables that reduced classification error were, in order of importance:

  • Age
  • FPL
  • “Not Looking” for work (could be because of a health issue, retirement, or any other reason)
  • Family size
  • Number of kids in the household

Overall, I think this is pretty interesting. Self-perceived health status is a personal thing. My model will correctly identify people with excellent/very good health 73 percent of the time, with a limited set of demographics. Conversely, it will rule out people without excellent/very good health 62% of the time. Not too shabby, and certainly better than chance.

Here are some closing thoughts:

  • I could have combined people with “good” perceived health with those with excellent/very good. I actually did try that, and the results were much worse. It seems people who rate themselves as in “good” health are more similar to those with fair/poor health. I did try a 3-level class, with “good” as its own class, but that didn’t perform well either.
  • Not surprisingly, age and income are the two most important variables  for reducing the error rate.
  • I would like to include variables about the communities in which people live (e.g., crime, urbancity, access to healthcare), but unfortunately, NHIS doesn’t release geocoded data in the public data sets.
  • I could probably reduce my set of predictors by dropping some highly correlated predictors.
  • I think it would be interesting to restrict the data to those who are older, say 55+. My hunch is the model would do even better.

Analysis done using R and RStudio.

Leave a Reply

Your email address will not be published. Required fields are marked *