Big Data: Big Disillusion?


Big Data: Big Disillusion?

Hyped, condemned, reprieved. On the tumultuous history of big data and four misconceptions about it.

The headlines blare it loud and clear: “Big Data, Big Problems” was the title run several months ago by the renowned business daily The Wall Street Journal, a newspaper not known for being hostile to innovation. Shortly thereafter, the Swiss business magazine Bilanz led with an even more explicit headline: “The Big Data Lie.” The journalists were further supported by the iconic newsletter published by CB Insights. The market research institute had analyzed how often startups used the terms “big data” and “artificial intelligence” during teleconferences with investors. It found that “artificial intelligence” dethroned “big data” as the dominant term in mid-2016 and has now become the conversation topic three times as often.

The euphoria over big data has definitely faded since the days around a decade ago when the magazine Wired proclaimed that this technology would render conventional research superfluous. Wired, the bible of Silicon Valley geeks, wrote that theories and hypotheses would henceforth no longer be necessary; computers would now discover correlations entirely on their own. In 2011, McKinsey & Company predicted that big data would enable the public sector in Europe to save EUR 250 billion annually, an amount greater than Greece’s gross domestic product. Five years later, the consultancy firm reviewed the state of play and acknowledged that only 10 to 20% of that cost-saving potential, at the most, had been realized.

So what’s the truth? Is big data making the world a better, more efficient, and more knowledgeable place? Or is it the new technology and not conventional research, contrary to the prophecy, that has turned out to be superfluous?

Misconception #1: Big Data Means Oodles of Data
The difficulties start with the definition: The description of what big data really is remains awfully vague. Dan Ariely, an acclaimed psychology professor who specializes in the study of irrational behavior, drew parallels between big data and the intimate love lives of adolescents in a widely publicized tweet: “Big data is like teenage sex: Everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it.”

The common definition of “big data” encompasses four terms that all start with the letter V: “Volume” means that enormous quantities of data are involved. “Velocity” refers to the fast speed with which data accrues and is processed. “Variety” expresses the fact that data can be of very different natures ranging from simple tweets to complex traffic data, and “veracity” means that the data must be truthful.

Misconception #2: Computers Are Intelligent
Computers are perfectly suited to performing a vast array of different tasks, but are simply too underdeveloped to handle other, more sophisticated jobs. Computers are still “remarkably dumb,” says John Giannandrea, the former head of artificial intelligence at Google and the new AI chief at Apple. He compares computers’ current stage of development to that of a “four-year-old child.” Users of smart speakers from Apple, Amazon, or Google know just what he means. A study revealed that smart speakers understand almost every question asked of them, but answer correctly only around 75% of the time.

 

I’m surprised how little computers can do.

Urs Hölzle, Google


Urs Hölzle, arguably the most prominent Swiss citizen in Silicon Valley, who joined Google as the company’s eighth employee, likewise said in an interview that he was “surprised how little computers can do” and cited an example. With enormous effort, a computer can be taught to recognize a zebra in a photo, he explains, but “it’s possible to modify a small number of pixels so that the computer will think it’s a race car.”

Misconception #3: Scrap Iron Can Turn Into Gold
“Big data initially aroused completely false expectations,” says Gregor Kalberer, Head Innovation Design & Technology at SIX. “The prevailing belief was that you could feed a supercomputer with immense amounts of entirely unstructured data and it would extract amazing insights and knowledge from that.” But an old saying among computer scientists holds that if you feed computers with bad input, you get bad output, or stated more pithily: “Garbage in, garbage out.”

Big data set out to disprove this “law of nature” in the field of software programming. Gregor Kalberer, who holds a doctorate in computer science from the Swiss Federal Institute of Technology in Zurich, explains that “the infrastructure for big data is indeed capable of very rapidly processing enormous volumes of data that were previously inconceivable, but the principle that scrap iron can’t turn into gold remains inviolable.”

The current holy grail of data analytics is finding a way to “understand” completely random and unstructured data and to sensibly edit it for actual calculation purposes, Gregor Kalberer continues. This step must take place with the least possible effort, he says. “If I have to elaborately format the bulk of the data manually, I don’t gain any efficiency, regardless of how powerful the computers doing the subsequent calculating are.”

Misconception #4: Eating Cheese Promotes Golfing
An imprecise definition, the slow development of intelligent systems, low-quality input: These factors are delaying big data’s breakthrough. Moreover, big data specialists are in short supply in many places. But yet another obstacle to overcome often gets overlooked. In order for big data to produce truly useful output, computers would have to be capable of distinguishing correlation from causation. They would have to be able to determine whether or not a relationship between two variables exists purely due to random chance.

Tyler Vigen entertainingly demonstrates that there’s not always a causal connection behind correlations. The US native’s blog, Spurious Correlations, and his book by the same name were the inspiration for this article’s photo sequence.

But how are computers to know, for example, that cheese consumption and golf course revenue in the USA correlate almost perfectly, but do not have any actual cause-and-effect connection? Or something a bit less trivial: How are they to know that ski-pass sales correlate with slopeside food and beverage consumption, yet the true causal variables are other ones such as the weather or the amount of snow on the ground? One oft-cited example illustrating this shortcoming of big data is the failure of Google Flu Trends. The idea was to use Google search queries to predict flu epidemics faster than before. It turned out, though, that many people who actually weren’t ill googled terms like “coughing” or “fever” because they had just, for instance, watched a TV health program on those symptoms. Google discontinued the service a couple of years later.

What Took Hours to Fail Works in Seconds
With Flu Trends, Google impressively demonstrated where big data doesn’t work, but it has also convincingly shown where and how the technology is actually capable of delivering astoundingly good results: The Google search engine can scan billions of websites simultaneously and is additionally able to rank search results. “If we employ big data the right way, it will be a game changer,” Gregor Kalberer says with conviction. “Through multiple tests at SIX, we have demonstrated that computing processes that previously broke down after several hours can now be executed successfully by us in a matter of a few seconds.”

 

If we employ big data the right way, it will be a game changer.

Gregor Kalberer, SIX


Success stories of that kind can be found in almost every company, industry, and country, which is why the big data market continues to expand. Research firm MarketsandMarkets estimates its growth rate at 17.6 % annually and sees big data becoming an 80-billion-dollar market by 2023 (up from almost USD 30 billion in 2017).

There indeed is additional evidence that big data is on the cusp of reaching adulthood. Pharmaceutical giant Roche last year bought Flatiron Health, a New York-based tech startup that analyzes patient data on a massive scale. Roche expects the acquired company to give it enormous research-and-development advantages in the field of oncology. The music on the other bank of the Rhine in Basel sounds the same: Novartis never tires of repeating the mantra that data analytics could lead to a “productivity revolution” in the pharma industry. With the help of digital technologies, the cost of clinical trials could be reduced by up to 25 %, Novartis CEO Vasant Narasimhan says.

Artificial Intelligence Is Being Revived by Big Data
Nevertheless, “big data” isn’t the buzz term of the moment; “artificial intelligence” is. Blame that in part on big data itself. In the early days of artificial intelligence, AI applications were often able to process only small amounts of data within a useful time frame. The infrastructure for big data helps to remove that limitation. With the advent of the Internet of Things (the network of interconnected devices) and the rollout of 5G wireless communication technology (which has many times more bandwidth capacity and is much faster than the 4G technology now in widespread use), data volumes and the number of possibilities for using that data will increase even further.

However, Gregor Kalberer says that some conditions must first be met to allow the technology to fully live up to its promise. “First, the use case must be clearly defined. Second, the pertinent information must exist in the original data. And third, the data must be computer-processable.” Data formatted for further processing is referred to as “smart data.” Such data isn’t smart in itself, but it does contain pertinent information in a form that enables actual knowledge to be gained. “This way, artificial intelligence isn’t even needed to derive added value from big data.” Traditional reporting and modeling would already suffice.