When you are about to travel through an asteroid field to flee a Star Destroyer, you might not want to know the odds of survival. If you are analyzing a dataset that has a dichotomous outcome, then you absolutely want to know the odds! In this post I will review when to use a logistic regression and the intuition behind it. In a future post I will provide an applied example in R.
When to Use a Logistic Regression
Logistic regression is used when the outcome we would like to predict is dichotomous. If we were to use regular ol’ linear regression to predict a dichotomous outcome, it would be a concern because:
- Predicted values could be less than 0 and greater than 1. It does not make sense to have a probability that is less than 0 or greater than 1, yet, a linear probability model can produce these.
- There is not constant variance in the outcome variable across values of predictors (known as homoscedasticity). This is not possible with a dichotomous outcome variable because the variance in the outcome decreases as more values are predicted to be 0 (or 1). That is, variance is constrained as a greater proportion of observations are 0 or 1.
- It is very difficult to justify a normal distribution of errors. This is because the outcome variable is either 0 or 1 and the errors will be bunched up rather than normally distributed. This is a concern for significance testing as it relies on normally distributed errors (at least approximately).
Although these are good reasons not to use linear regression for predicting dichotomous outcomes, it is relatively common to do so with a linear probability model (especially among economists). And actually, often times the results from a linear probability model and logistic regression will be quite similar. This is especially true if the modeled probabilities are between 0.20 and 0.801. When probabilities are predicted below or above these points, then the logit model is superior.
Super Short Review of Linear Regression
For many people (including me), linear regression is easy to understand. You have a scatter plot of two variables and you are simply finding a line that best describes the association between them. It looks like Figure 1.
This graph show a household’s income as a percent of the federal poverty level (FPL) on the x-axis and the age of the respondent on the y-axis. Both of these are continuous variables. The line has a slightly positive slope, which means that as age increases so does FPL. There are a bunch of observations right at the 400% mark because this particular survey “top codes” any value over 400 at 400. This example is using real-world data from the National Survey of Children’s Health.
Part of the reason linear regression is easy to understand is because the original units remain intact when interpreting the results. Age and FPL remain exactly as they are observed and we are just looking at how they are associated (i.e., correlated). For example, a one year increase in age is associated with a 5 point increase in FPL. Pretty easy to understand, right?
With a dichotomous variable, we only observe two states: 0 or 1. Think of a y-axis for this variable, it only has two points of interest 0 and 1. Thinking of the graph above (i.e., Figure 1), if we plotted a dichotomous variable (less than high school diploma vs. high school diploma or more), it looks like Figure 2. You can see how this differs from the scatter plot above. Although technically possible to create a linear regression line, it doesn’t make as much sense.
Intuition Behind Logistic Regression
Logistic regression gets a bit more complicated with the introduction of odds and log-odds, but at its heart, it is still easy to understand once you realize that logistic regression is just trying really hard to be a linear regression.
The main issue preventing a logistic regression from being similar in interpretation to a linear regression is the use of odds. An odds is the probability that the event will occur divided by the probability that the event will not occur. The nice thing about using odds is that they are naturally setup for use with dichotomous variables2. The bad thing is that they are harder to interpret and are not symmetrical.
This is where the log-odds comes into play. When you take the log of an odds, you are helping to ensure that it is symmetrical, linear, and normally distributed.
- Symmetrical means that the distance from 0 is the same for negative and positive numbers. For example, −5 is five units away from 0 and so is +5. If we did not take the log of the odds, then this would not be the case and creates problems with interpretation3.
- Linear means that when you take the log of and odds, it converts the odds into a straight line that can range from negative to positive infinity, just like in regular linear regression.
- Normally distributed means that the log of the odds will form a normal distribution in a large enough sample. This is important for calculating p-values and confidence intervals.
Once we have values that are symetrical, linear, and normally distributed, we are in good shape. This is all thanks to the logit function. The point to remember is that the logit function serves as a “key” that allows us to more easily understand our model. It is doing all the hard work.
But you might wonder “Okay, we have the log of the odds which is symmetrical, linear, and can be normally distributed, but how the heck do you interpret the “log of an odds?” To address this, we have two options:
- Convert the log of the odds to an odds ratio.
- The pro of using odds ratios is that it gives you a sense of the likelihood of your dependent variable occurring for each unit increase in an independent variable, adjusting for all other covariates in your model.
- This is nice because you can interpret your focal association without too much concern of the values of your other covariates in your model.
- The con is that odds ratios aren’t intuitive4.
- Convert the log of the odds to a predicted probability (See Figure 3 below).
- The pro of using probabilities is that they are easy to interpret.
- The con is that the probability of your dependent variable changes based on the value of your independent variable(s) of interest and covariates.
- To overcome this, analysts will typical set covariates at their mean for continuous variables and the most common category for categorical variables.
Both are good options and depending on your field, one may be more popular than the other. I actually like presenting both if you have space and it is appropriate for your audience.
In my next post, I will give examples on the use of odds ratios and predicted probabilities. Some of you might have noticed that Figures 2 and 3 look pretty similar even though Figure 2 is a linear model and Figure 3 is from a logistic model. There are only slight differences in the tails. We will discuss this next too!