Featured
How to Use Survey Weights in R

## How to Use Survey Weights in R

Survey weights are common in large-scale government-funded data collections. For example, NHIS and NHANES are two large scale surveys that track the health and well-being of Americans that have survey weights. These data collections use complex and multi-stage survey sampling to ensure that results are representative of the U.S. population. Although use of survey weights is sometimes contested in regression analyses, they are needed for simple means and proportions. The general guidance is that if analysts can control for the…

How to do fuzzy matching in R

## How to do fuzzy matching in R

Fuzzy Wuzzy was a bear,Fuzzy Wuzzy had no hair,Fuzzy Wuzzy wasn’t very Fuzzy,Was he? — Extremely Relevant Children’s Rhyme Fuzzy matching links two or more non-identical character strings together. Ideally, when linking data sets together, there would be a unique variable that identifies each row (or rows) in each data set. We do not, however, live in an ideal world. Often times when getting data from sources or systems that are not explicitly linked, we won’t have a perfect unique…

Stata to R: How to Tabulate a Categorical Variable

## Stata to R: How to Tabulate a Categorical Variable

When working with a data set, one of the first things I do is look at the count and relative frequency of categorical variables of interest. In Stata, this is relatively straight forward with the tab command. In R, however, it isn’t quite as straight forward, but still possible via the dplyr package.  You might also be interested in my other posts on getting started with R: How to Rename Variables in R – Stylized Data How to Recode Factor…

How to Recode Factor and Character Variables in R

## How to Recode Factor and Character Variables in R

Recoding factor and character variable values is a common task in data analysis. Although common, it isn’t as easy as you might expect in R, especially compared to Stata and SAS. I’ve found that people often get the most frustrated with these basic tasks when learning R, so in this post my goal is to take away that frustration. Often we want to use factor variables that can be dummy coded and easily used in regression models. Other times we…

How to Rename Variables in R

## How to Rename Variables in R

I want to show you how to rename variables in R. This is a basic task but one that I do frequently when working with a new dataset. Renaming variables is useful, especially when creating graphics. For example, if I were plotting these data, I would want the variable name to show as “Coffee Roast” rather than “coffee.” If I were just doing data wrangling, I wouldn’t care as much about the variable name. But when presenting data, I want the…

Household Food Insecurity by Region and State

## Household Food Insecurity by Region and State

Food insecure households experience disruptions in the quality and quantity of the household food supply due to a lack of resources. They are struggling to make ends meet. I wanted to create a visualization that shows the stark differences in rates of food insecurity over time and by region and state. Typically, national-level estimates get the spotlight because we all want one number that summarizes the issue. What I think happens though, is that variation gets put in the background…

Never Tell Me The Odds! Except When Using Logistic Regression

## Never Tell Me The Odds! Except When Using Logistic Regression

When you are about to travel through an asteroid field to flee a Star Destroyer, you might not want to know the odds of survival. If you are analyzing a dataset that has a dichotomous outcome, then you absolutely want to know the odds! In this post I will review when to use a logistic regression and the intuition behind it. In a future post I will provide an applied example in R. When to Use a Logistic Regression Logistic…

District on Fire: Arson in DC from 2012-2019

## District on Fire: Arson in DC from 2012-2019

I was listening to the news and they mentioned a house fire that was likely arson. I wondered how often arson occurs in DC. I felt like it has to be somewhat rare, but I really had no idea. Luckily, Open Data DC has crime data, including cases of arson in the district by Ward. Arson is defined in DC as (take a deep breath): “The malicious burning or attempt to burn any dwelling, house, barn, or stable adjoining thereto,…

Visualizing a Continuous by Continuous Interaction in Linear Regression

## Visualizing a Continuous by Continuous Interaction in Linear Regression

View high res version here. “Did you see if there was a difference by [insert the person’s favorite population or topic of interest].” With a resigned response I reply “No I haven’t, but that is a good idea.” I’ve had this exchange in every research talk I’ve given. Often times we are interested in understanding how one group varies by another when predicting an outcome. This can be examined via an interaction term in a regression model. An interaction term…

Using Python to interface with PostgreSQL

## Using Python to interface with PostgreSQL

A big part of doing data analyses is simply getting the data. Sometimes this is super easy and takes minutes. Other times it can be a bit more complicated. This is especially true if data are stored in multiple data sets, as often the case in data science projects or large-scale epidemiological studies. Luckily, the Structured Query Language (SQL) makes this task much easier and less error prone. PostgreSQL (Postgres for short) is a powerful, open source object-relational database system…