My Digital Vote projects

Roberto Cerina and Ray Duch
Nuffield College, University of Oxford

Can digital trace replace random-digit-dialing? As part of the My Digital Vote project, we follow the digital trace of Texas Facebook users to predict the outcome of Texas congressional district elections. Our predicted district-level vote share, that relies on zero polling data input, generate essentially the same results as the aggregated conventional polling data. There are 36 congressional districts in Texas. 33 of them are not that competitive – and in these easy cases our predictions match the prevailing polling wisdom. For the other three? We agree with the FiveThirtyEight forecast about District 32 – they take it to be a closer race, but still call it for the Republican. District 7 for FiveThirtyEight is a complete toss-up while we call it for the Republican. District 23 is a toss-up for us while they call it for the Republican. The Congressional District map of Texas summarizes our results.

District 1 District 2 District 3 District 4 District 5 District 6 District 7 District 8 District 9 District 10 District 11 District 12 District 13 District 14 District 15 District 16 District 17 District 18 District 19 District 20 District 21 District 22 District 23 District 24 District 25 District 26 District 27 District 28 District 29 District 30 District 31 District 32 District 33 District 34 District 35 District 36
Figure 1: Digital Vote predicted GOP Percent Vote Share.

In this Texas experiment a small sample of Facebook profiles produced high-frequency estimates of district-level vote share of comparable quality to state-of-the-art survey-based models.

This Texas experiment was initiated in May 2018. Is it feasible to use behavior on social media to accurately forecast election results? Recent high-profile polling failures motivated the project [1,2]. Could low-cost monitoring of social media increase the frequency of observing partisan preferences; increase the geographical scope of these observations; as well as reduce potential bias by leveraging revealed as opposed to stated preferences?

Some technical details

To get the forecasts we treat public, explicit measures of candidate support on Facebook -- such as likes, loves or explicitly positive comments – as proxies for voting intention. We then match users, who had made these public shows of candidate support, to a record on the public Texas Voter Registration file. The resulting data is treated as a survey. The district-level vote shares are then calculated using standard Multilevel Regression and Post-stratification (MRP) techniques [3,4]. The main difference is that we did not limit ourselves to parametric models – rather we use a Random Forest based Probability Machine [5]. We refer to this new approach simply as MLPs— Machine Learning and Post-stratification.


Since we don’t have the election results yet (the election is on Tuesday, November 6th) we benchmark against the forecast from The FiveThirtyEight classic model is based on the latest public opinion polls, long-term trends in aggregate voting behavior and polling errors, and correlations across similar districts over the whole nation. It turns out that observing the behavior of about 6,000 registered voters active on Facebook can provide us with as accurate a measure of support as leveraging thousands of opinion polls, each with a sample size ~1000 or more.


Figure 2 compares the estimates of Republican 2-party vote share based on the Digital Vote project against the FiveThirtyEight forecast, for the week before election day. Our Digital Vote forecasts, particularly for the least competitive races, tend to gravitate closer to 0.5 than is the case for FiveThirtyEight. This reflects a noisier signal from the social media sample. Note the uncertainty intervals around our predictions are narrower compared to FiveThirtyEight reflecting the very different estimation methods employed here.

District 13 District 11 District 8 District 19 District 4 District 1 District 36 District 26 District 12 District 5 District 3 District 31 District 27 District 17 District 24 District 21 District 25 District 2 District 22 District 10 District 14 District 6 District 7 District 32 District 23 District 20 District 15 District 28 District 34 District 35 District 16 District 29 District 33 District 18 District 9 District 30
Figure 2: FB driven estimates of support for the Republican party v. forecast from — Snapshot for the week before election day (October 19th to November 4th). The numbers over the intervals indicate the congressional district.

As mentioned earlier, only three Texas Congressional Districts are expected to be in contention. Our estimates agree with the FiveThirtyEight forecast in the case of District 32 (close but Republican); and we disagree in District 7 (we are calling it for the Republicans); and disagree in District 23 (we are calling it a toss-up).


[1]: Sturgis, Patrick, et al. “Report of the Inquiry into the 2015 British general election opinion polls.” (2016).

[2]: Kennedy, C., Blumenthal, M., Clement, S., Clinton, J. D., Durand, C., Franklin, C., … & Saad, L. (2018). An evaluation of the 2016 election polls in the United States. Public Opinion Quarterly, 82(1), 1–33.

[3]: Lauderdale, B. E., Bailey, D., Blumenau, Y. J., & Rivers, D. (2017). Model-Based Pre-Election Polling for National and Sub-National Outcomes in the US and UK. Working paper.

[4]: Wang, W., Rothschild, D., Goel, S., & Gelman, A. (2015). Forecasting elections with non-representative polls. International Journal of Forecasting, 31(3), 980–991.

[5]: Malley, J. D., Kruppa, J., Dasgupta, A., Malley, K. G., & Ziegler, A. (2012). Probability machines. Methods of Information in Medicine, 51(01), 74–81.

Roberto Cerina and Ray Duch
Nuffield College, University of Oxford

The 2019 India Lok Sabha elections are expected to be a victory for the National Democratic Alliance (NDA, led by Modi’s BJP). CESS Vote India is predicting the NDA coalition of Prime Minister Modi will retain control of a majority in the Lok Sabha. The simulated average number of seats predicted for NDA is 304. CESS Vote India is predicting that the United Progressive Alliance (UPA, led by Gandhi’s Congress) will achieve 119 seats. All other alliances are predicted to get 120 seats. This is based on a total of 12,500 voting intentions that were surveyed. Our forecasts are the average of the estimates obtained from surveys (conducted with the CESS India subject pool and MTurk workers in India) and the traditional published polls in India.

Figure 1: Histogram of predicted seats by party for 500 simulated elections. The faded histograms in the background represent the two components of the forecast, namely our online surveys and publicly available traditional polls. The point estimate is identified by the average of the two sources. The dark green line represents the 272 seats threshold - anything above that gives a majority in the Lok Sabha.

India Vote is an election forecasting project that proposes novel strategies for generating vote and seat predictions employing convenience online subject pools – specifically the CESS India online subject pool. CESS India and Optimus Consulting have jointly funded the India Vote project. CESS India is a collaboration between the Nuffield Centre for Experimental Social Sciences and FLAME University, Pune India. Optimus Consulting is a Washington, D.C.-based consulting firms.

The novelty of the CESS India Vote forecasting project is three-fold. First, we incorporate multiple and quite diverse online convenience samples as part of the estimation strategy – we complement information from the CESS India subjects with very regular surveys of MTurk workers in India. Different convenience samples will add complementary information to forecasts of this nature – the broader issue is how to identify the convenience samples that provide optimal complementarity to forecasts of this nature. We propose estimation strategies that integrate quite disparate subject pools. Figure 2 summarizes the differences between our population data and the data obtained from our CESS and MTurk samples.

Figure 2: Differences between our population frame and the re-sampled, non-probability samples, in per- centages. The percentages that these pop − sample difference are calculated from sum to one by category (i.e. for gender, % male and % female sum to 1; similarly for income and education categories, etc.). Above the dark green line we are under-sampling; underneath it we are over-sampling.

Second, we organize the Indian nation into a stratification frame that has 6611 cells – essentially these are defined by the number of demographic categories in our forecasting model that includes income, religion, caste, gender age and education. We generate estimated vote probabilities for all of these cells (or, if you wish, demographic categories) feeding survey data collected from our different online subject pools to a random forest. A novelty here is estimating the vote probabilities separately for the CESS Online and MTurk subjects and then combining the estimates. Figure 3 presents the expected vote share by alliance for the online sample.

Figure 3: Expected vote share by alliance: the NDA is in red/orange; the UPA in blue/skyblue; the others in black/grey. From left to right: a) 500 simulations of the expected vote share over the 13 weeks monitoring period; b) the breakdown of expected vote share by source (CESS subjects and mechanical turks); c) 500 simulations of the national swing since the 2014 election.

Thirdly, we forecast the national vote share for the major parties by simply applying the cell probabilities of voting for each party to population estimates for those cells obtained from the Indian Human Development Survey. This allows us to estimate a national swing for each of the major parties. The novelty here is that we use historical data on the relationship between national and state swings in vote shares to estimate the seat shares for the state parties. The state level multipliers of the national swing are presented in Figure 4.

Figure 4: Graphical representation of the state level multiplier of the national swing with uncertainty bounds (2 standard deviations). The dotted green line represents the National Swing.

The counterpart to our seat-estimation by online surveys is a forecast based on traditional opinion polling. We analyse 42 polls from the 2009, 2014 and 2019 election, and attempt to quantify the polling house error, and remove it from a moving average of the polls. The results of this effort are then averaged with our online surveys. Figure 5 shows the expected number of seats according to our traditional opinion polls model, over the 2019 campaign. The symbols on the plot represent specific polling houses.

Figure 5: Expected number of seats by alliance for the 2019 Lok Sabha election, net of house bias.