If you’ve been on our site, you’ve likely seen our Democratic primary predictions. This post explains our models and how we come up with our Democratic forecasts.
Our first model predicts state winners; it uses a logistic regression and runs simulations for each election 1000 times.
The state winner model uses the following independent variables: whether the contest is a caucus, state percent white, black, and Hispanic, state GDP per capita, and the state’s average age. The dependent variable is a Clinton victory (a logistic model with a Clinton victory coded “1”). Here are its predictions for the primary season as well as the each primary/caucus’s actual winner:
|State||Hillary Clinton Win Probability||Bernie Sanders Win Probability||Actual Winner|
For Democratic primary vote share predictions, I use a dummy variable for caucus, state percent African American, white, and Hispanic, state GDP per capita, percent of the state population between 18 and 25, a dummy variable for whether Clinton is predicted to win (found above), and a dummy variable for the South. The dependent variable is her actual vote share. This approach explains around 93 percent of vote share variation. Its predictions as well as actual outcomes are shown below:
|State||Hillary Clinton Vote Share||Actual Clinton Vote Share|
Each contest yields new data points that are then put into the model. The two tables shown depict initial predictions, not outputs from after the model has considered new information and learned from it (which, when backtested, presents more accurate results than our first estimates – no surprise there). Our Democratic primary predictions become more accurate as n (completed primaries/caucuses) increases, leading to a higher degree of confidence with which we can forecast the primaries.
Feel free to comment with questions or suggestions – we’re always looking to improve our model!