### Browsed byCategory: 2018

Visualizing a Continuous by Continuous Interaction in Linear Regression

## Visualizing a Continuous by Continuous Interaction in Linear Regression

View high res version here. “Did you see if there was a difference by [insert the person’s favorite population or topic of interest].” With a resigned response I reply “No I haven’t, but that is a good idea.” I’ve had this exchange in every research talk I’ve given. Often times we are interested in understanding how one group varies by another when predicting an outcome. This can be examined via an interaction term in a regression model. An interaction term…

Renaming Variables and Character Strings in R

## Renaming Variables and Character Strings in R

I want to show you how to rename variables and character strings in R. This is a basic task but one that I do frequently when working with a new dataset. Renaming variables and character strings is useful, especially when creating graphics. For example, if I were plotting these data, I would want the variable name to show as “Coffee Roast” rather than “coffee.” If I were just doing data wrangling, I wouldn’t care as much about the variable name….

Fuzzy Wuzzy Was a…School? Adventures in Fuzzy Matching!

## Fuzzy Wuzzy Was a…School? Adventures in Fuzzy Matching!

Fuzzy Wuzzy was a bear, Fuzzy Wuzzy had no hair, Fuzzy Wuzzy wasn’t very Fuzzy, Was he? — Extremely Relevant Children’s Rhyme Let’s talk about fuzzy matching. Fuzzy matching links two or more non-identical character strings together. Ideally, when linking data sets together, there would be a unique variable that identifies each row (or rows) in each data set. We do not, however, live in an ideal world. Often times when getting data from sources or systems that are not…

Using Python to interface with PostgreSQL

## Using Python to interface with PostgreSQL

A big part of doing data analyses is simply getting the data. Sometimes this is super easy and takes minutes. Other times it can be a bit more complicated. This is especially true if data are stored in multiple data sets, as often the case in data science projects or large-scale epidemiological studies. Luckily, the Structured Query Language (SQL) makes this task much easier and less error prone. PostgreSQL (Postgres for short) is a powerful, open source object-relational database system…

Life is Like a Box of Samples, You Never Know What You’re Gonna Get

## Life is Like a Box of Samples, You Never Know What You’re Gonna Get

Good ol’ Forrest Gump. He is right, you never know what you’re going to get, and that is why we use resampling techniques in data science! In this post, I will review two popular resampling techniques for predictive models and give examples of how to implement them in R. But first, let’s review the basics. What does “resample” mean? Often we only have one dataset to examine and use to build prediction models. Let’s call this dataset our sample. Resampling…

How to Use Survey Weights in R

## How to Use Survey Weights in R

Survey weights are common in large-scale government-funded data collections. For example, NHIS and NHANES are two large scale surveys that track the health and well-being of Americans that have survey weights. These data collections use complex and multi-stage survey sampling to ensure that results are representative of the U.S. population. Although use of survey weights is sometimes contested in regression analyses, they are needed for simple means and proportions. The general guidance is that if analysts can control for the…

Using a Forest Plot to Display Regression Results

## Using a Forest Plot to Display Regression Results

View high res version here. How many times have you sat through a presentation and stared blankly at a table of regression results? If you have been to my presentations, it has been many, many times. I was thinking about a better and more intuitive way to present regression results that also gives a sense of uncertainty. In this post, I show how to visualize OLS regression results via a forest plot. The nice thing about forest plots is that they…

Visualizing Body Mass Index by Percent of the Federal Poverty Level

## Visualizing Body Mass Index by Percent of the Federal Poverty Level

View high res version here. It is often assumed that low-income populations have worse health outcomes. This is correct for many health outcomes, but for one, it isn’t as clear: Body Mass Index (BMI). BMI doesn’t show a clear association with income. Given that we are in the midst of an obesity epidemic, and BMI is the primary measurement of obesity, this is quite interesting. I wanted to take a look at some recent data to visualize the association. I created…

Not Feeling So Great? A Random Forest Analysis of Demographics and Health

## Not Feeling So Great? A Random Forest Analysis of Demographics and Health

Each of us have an intuition of our own health and understand when we are feeling great, or not not so great. Indeed, self-perceived health status even predicts mortality. I started to wonder how well basic demographics, such as age, gender, and income predict self-perceived health status. We know demographics strongly correlate with lots of other health outcomes, but I have never seen them used to predict self-perceived health status. In this post I use a Random Forest classifier to predict…

Stata to R: How to Tabulate a Categorical Variable

## Stata to R: How to Tabulate a Categorical Variable

When working with a data set, one of the first things I do is look at the count and relative frequency of categorical variables of interest. In Stata, this is relatively straight forward with the tab command. In R, however, it isn’t quite as straight forward, but still possible. The equivalent code in R, is: dataset %>% group_by(var) %>% summarise(n = n()) %>% mutate(totalN = (cumsum(n)), percent = round((n / sum(n)), 3), cumpercent = round(cumsum(freq = n / sum(n)), 3))…