Never tell me the odds! Except when using Logistic Regression.

Never tell me the odds! Except when using Logistic Regression.

When you are about to travel through an asteroid field to flee a Star Destroyer, you might not want to know the odds of survival. If you are analyzing a dataset that has a dichotomous outcome, then you absolutely want to know the odds! In this post I will review when to use a logistic regression and the intuition behind it. In a future post I will provide an applied example in R. When to Use a Logistic Regression Logistic…

Read More Read More

District on Fire: Arson in DC from 2012-2019

District on Fire: Arson in DC from 2012-2019

I was listening to the news and they mentioned a house fire that was likely arson. I wondered how often arson occurs in DC. I felt like it has to be somewhat rare, but I really had no idea. Luckily, Open Data DC has crime data, including cases of arson in the district by Ward. Arson is defined in DC as (take a deep breath): “The malicious burning or attempt to burn any dwelling, house, barn, or stable adjoining thereto,…

Read More Read More

Visualizing a Continuous by Continuous Interaction in Linear Regression

Visualizing a Continuous by Continuous Interaction in Linear Regression

View high res version here. “Did you see if there was a difference by [insert the person’s favorite population or topic of interest].” With a resigned response I reply “No I haven’t, but that is a good idea.” I’ve had this exchange in every research talk I’ve given. Often times we are interested in understanding how one group varies by another when predicting an outcome. This can be examined via an interaction term in a regression model. An interaction term…

Read More Read More

Renaming Variables and Character Strings in R

Renaming Variables and Character Strings in R

I want to show you how to rename variables and character strings in R. This is a basic task but one that I do frequently when working with a new dataset. Renaming variables and character strings is useful, especially when creating graphics. For example, if I were plotting these data, I would want the variable name to show as “Coffee Roast” rather than “coffee.” If I were just doing data wrangling, I wouldn’t care as much about the variable name….

Read More Read More

Fuzzy Wuzzy Was a…School? Adventures in Fuzzy Matching!

Fuzzy Wuzzy Was a…School? Adventures in Fuzzy Matching!

Fuzzy Wuzzy was a bear, Fuzzy Wuzzy had no hair, Fuzzy Wuzzy wasn’t very Fuzzy, Was he? — Extremely Relevant Children’s Rhyme Let’s talk about fuzzy matching. Fuzzy matching links two or more non-identical character strings together. Ideally, when linking data sets together, there would be a unique variable that identifies each row (or rows) in each data set. We do not, however, live in an ideal world. Often times when getting data from sources or systems that are not…

Read More Read More

Using Python to interface with PostgreSQL

Using Python to interface with PostgreSQL

A big part of doing data analyses is simply getting the data. Sometimes this is super easy and takes minutes. Other times it can be a bit more complicated. This is especially true if data are stored in multiple data sets, as often the case in data science projects or large-scale epidemiological studies. Luckily, the Structured Query Language (SQL) makes this task much easier and less error prone. PostgreSQL (Postgres for short) is a powerful, open source object-relational database system…

Read More Read More

Life is Like a Box of Samples, You Never Know What You’re Gonna Get

Life is Like a Box of Samples, You Never Know What You’re Gonna Get

Good ol’ Forrest Gump. He is right, you never know what you’re going to get, and that is why we use resampling techniques in data science! In this post, I will review two popular resampling techniques for predictive models and give examples of how to implement them in R. But first, let’s review the basics. What does “resample” mean? Often we only have one dataset to examine and use to build prediction models. Let’s call this dataset our sample. Resampling…

Read More Read More

How to Use Survey Weights in R

How to Use Survey Weights in R

Survey weights are common in large-scale government-funded data collections. For example, NHIS and NHANES are two large scale surveys that track the health and well-being of Americans that have survey weights. These data collections use complex and multi-stage survey sampling to ensure that results are representative of the U.S. population. Although use of survey weights is sometimes contested in regression analyses, they are needed for simple means and proportions. The general guidance is that if analysts can control for the…

Read More Read More

Using a Forest Plot to Display Regression Results

Using a Forest Plot to Display Regression Results

View high res version here. How many times have you sat through a presentation and stared blankly at a table of regression results? If you have been to my presentations, it has been many, many times. I was thinking about a better and more intuitive way to present regression results that also gives a sense of uncertainty. In this post, I show how to visualize OLS regression results via a forest plot. The nice thing about forest plots is that they…

Read More Read More

Visualizing Body Mass Index by Percent of the Federal Poverty Level

Visualizing Body Mass Index by Percent of the Federal Poverty Level

View high res version here. It is often assumed that low-income populations have worse health outcomes. This is correct for many health outcomes, but for one, it isn’t as clear: Body Mass Index (BMI). BMI doesn’t show a clear association with income. Given that we are in the midst of an obesity epidemic, and BMI is the primary measurement of obesity, this is quite interesting. I wanted to take a look at some recent data to visualize the association. I created…

Read More Read More