The lesson of Trump’s victory is not that data is dead. The lesson is that data is flawed. It has always been flawed—and always will be.
Before Donald Trump won the presidency on Tuesday night, everyone from Nate Silver to The New York Times to CNN predicted a Trump loss—and by sizable margins. “The tools that we would normally use to help us assess what happened failed,” Trump campaign reporter Maggie Haberman said in the Times. As Haberman explained, this happened on both sides of the political divide.
Appearing on MSNBC, Republican strategist Mike Murphy told America that his crystal ball had shattered. “Tonight, data died,” he said.
But this wasn’t so much a failure of the data as it was a failure of the people using the data. It’s a failure of the willingness to believe too blindly in data, not to see it for how flawed it really is. “This is a case study in limits of data science and statistics,” says Anthony Goldbloom, a data scientist who once worked for Australia’s Department of Treasury and now runs a Kaggle, a company dedicated to grooming data scientists. “Statistics and data science gets more credit than it deserves when it’s correct—and more blame than it deserves when it’s incorrect.”
With presidential elections, these limits are myriad. The biggest problem is that so little data exists. The United States only elects a president once every four years, and that’s enough time for the world to change significantly. In the process, data models can easily lose their way. In the months before the election, pollsters can ask people about their intentions, but this is harder than it ever was as Americans move away from old-fashioned landline phones towards cell phones, where laws limit such calls. “We sometimes fool ourselves into thinking we have a lot of data,” says Dan Zigmond, who helps oversee data science at Facebook and previously handled data science for YouTube and Google Maps. “But the truth is that there’s just not a lot to build on. There are very small sample sizes, and in some ways, each of these elections is unique.”
In the wake of Trump’s victory, Investor’s Business Daily is making the media rounds boasting that it correctly predicted the election’s outcome. Part of the trick, says IBD spokesperson Terry Jones, is that the poll makes more calls to smartphones than landlines, and that the people it calls represent the wide range of people in the country. “We have a representative sample of even the types of phones used,” he says. But this poll was the exception that proved the rule: the polling on the 2016 presidential election was flawed. In the years to come, the electorate—and the technology used by the electorate—will continue to change, ensuring future polls will have to evolve to keep up.
No Data Is Perfect
As the world makes the internet its primary means of communication, that transition brings with it the promise of even more data—so-called “Big Data,” in Silicon Valley marketing-speak. In the run-up to the election, a company called Networked Insights mined data on Twitter and other social networks in an effort to better predict which way the electoral winds would blow. It had some success—the company predicted a much tighter race than more traditional poll aggregators, and other companies and researchers are moving in similar directions. But this data is also flawed. With a poll, you’re asking direct questions of real people. On the Internet, a company like Networked Insights must not only find accurate ways of determining opinion and intent from a sea of online chatter, but build a good way of separating the fake chatter from the real, the bots from the humans. “As a data scientist, I always think more data is better. But we really don’t know how to interpret this data,” Zigmond says. “It’s hard to figure out how all these variables are related.”
‘The way that bias creeps into any analysis is the way the data is selected.’
Meanwhile, at least among the giants of the Internet, the even bigger promise is that artificial intelligence will produce better predictions that ever before. But this too still depends on data that can never really provide a perfect picture on which to base a prediction. A deep neural network can’t forecast an election unless you give it the data to make the forecast, and the way things work now, this data must be carefully labeled by humans for the machines to understand what they’re ingesting. Yes, AI systems have gotten very good at recognizing faces and objects in photos because people have uploaded so many millions of photos to places like Google and Facebook already, photos whose contents have been labeled such that neural networks can learn to “see” what they depict. The same kind of clean, organized data on presidential elections doesn’t exist to train neural nets.
People will always say they’ve cracked the problem. IBD is looking mighty good this week. Meanwhile, as Donald Trump edged towards victory Tuesday, his top data guru, Matt Oczkowski, told WIRED the campaign had known for weeks that a win was possible. “Our models predicted most of these states correctly,” he said. But let’s look at these two with as much skepticism as we’re now giving to Silver and the Times.
Naturally, Oczkowski shot down the “data is dead” meme. “Data’s alive and kicking,” he said. “It’s just how you use it and how you buck normal political trends to understand your data.” In a way, he’s right. But this is also part of the problem. We don’t know what Oczkowski’s methods were. And in data science, people tend to pick data that supports their point of view. This is a problem whether you’re using basic statistical analysis or neural networks.
“The way that bias creeps into any analysis is the way the data is selected,” Goldbloom says.
In other words, the data used to predict the outcome of one of the most important events in recent history was flawed. And so are we.