Michael Jordan: Statistician Perspective on Big Data and on the role of Machine Learning in Modeling
Reporter: Aviva Lev-Ari, PhD, RN
Personal Note
I have used Michael Jordan’s Statistical articles on Brownian Motion, including, Graphical Models: Foundations of Neural Computation (Computational Neuroscience) on a CONSULTING assignment to Fidelity Investments, Derivatives Department in 1994.
Michael Jordan: Interview that has been published in the IEEE Spectrum
SOURCE
Why Big Data Could Be a Big Fail
Spectrum: If we could turn now to the subject of big data, a theme that runs through your remarks is that there is a certain fool’s gold element to our current obsession with it. For example, you’ve predicted that society is about to experience an epidemic of false positives coming out of big-data projects.
Michael Jordan: When you have large amounts of data, your appetite for hypotheses tends to get even larger. And if it’s growing faster than the statistical strength of the data, then many of your inferences are likely to be false. They are likely to be white noise.
Spectrum: How so?
Michael Jordan: In a classical database, you have maybe a few thousand people in them. You can think of those as the rows of the database. And the columns would be the features of those people: their age, height, weight, income, et cetera.
Now, the number of combinations of these columns grows exponentially with the number of columns. So if you have many, many columns—and we do in modern databases—you’ll get up into millions and millions of attributes for each person.
Now, if I start allowing myself to look at all of the combinations of these features—if you live in Beijing, and you ride bike to work, and you work in a certain job, and are a certain age—what’s the probability you will have a certain disease or you will like my advertisement? Now I’m getting combinations of millions of attributes, and the number of such combinations is exponential; it gets to be the size of the number of atoms in the universe.
Those are the hypotheses that I’m willing to consider. And for any particular database, I will find some combination of columns that will predict perfectly any outcome, just by chance alone. If I just look at all the people who have a heart attack and compare them to all the people that don’t have a heart attack, and I’m looking for combinations of the columns that predict heart attacks, I will find all kinds of spurious combinations of columns, because there are huge numbers of them.
So it’s like having billions of monkeys typing. One of them will write Shakespeare.
Spectrum:Do you think this aspect of big data is currently underappreciated?
Michael Jordan: Definitely.
Spectrum: What are some of the things that people are promising for big data that you don’t think they will be able to deliver?
Michael Jordan: I think data analysis can deliver inferences at certain levels of quality. But we have to be clear about what levels of quality. We have to have error bars around all our predictions. That is something that’s missing in much of the current machine learning literature.
Spectrum: What will happen if people working with data don’t heed your advice?
Michael Jordan: I like to use the analogy of building bridges. If I have no principles, and I build thousands of bridges without any actual science, lots of them will fall down, and great disasters will occur.
Similarly here, if people use data and inferences they can make with the data without any concern about error bars, about heterogeneity, about noisy data, about the sampling pattern, about all the kinds of things that you have to be serious about if you’re an engineer and a statistician—then you will make lots of predictions, and there’s a good chance that you will occasionally solve some real interesting problems. But you will occasionally have some disastrously bad decisions. And you won’t know the difference a priori. You will just produce these outputs and hope for the best.
And so that’s where we are currently. A lot of people are building things hoping that they work, and sometimes they will. And in some sense, there’s nothing wrong with that; it’s exploratory. But society as a whole can’t tolerate that; we can’t just hope that these things work. Eventually, we have to give real guarantees. Civil engineers eventually learned to build bridges that were guaranteed to stand up. So with big data, it will take decades, I suspect, to get a real engineering approach, so that you can say with some assurance that you are giving out reasonable answers and are quantifying the likelihood of errors.
Spectrum: Do we currently have the tools to provide those error bars?
Michael Jordan: We are just getting this engineering science assembled. We have many ideas that come from hundreds of years of statistics and computer science. And we’re working on putting them together, making them scalable. A lot of the ideas for controlling what are called familywise errors, where I have many hypotheses and want to know my error rate, have emerged over the last 30 years. But many of them haven’t been studied computationally. It’s hard mathematics and engineering to work all this out, and it will take time.
It’s not a year or two. It will take decades to get right. We are still learning how to do big data well.
Spectrum: When you read about big data and health care, every third story seems to be about all the amazing clinical insights we’ll get almost automatically, merely by collecting data from everyone, especially in the cloud.
Michael Jordan: You can’t be completely a skeptic or completely an optimist about this. It is somewhere in the middle. But if you list all the hypotheses that come out of some analysis of data, some fraction of them will be useful. You just won’t know which fraction. So if you just grab a few of them—say, if you eat oat bran you won’t have stomach cancer or something, because the data seem to suggest that—there’s some chance you will get lucky. The data will provide some support.
But unless you’re actually doing the full-scale engineering statistical analysis to provide some error bars and quantify the errors, it’s gambling. It’s better than just gambling without data. That’s pure roulette. This is kind of partial roulette.
Spectrum: What adverse consequences might await the big-data field if we remain on the trajectory you’re describing?
Michael Jordan: The main one will be a “big-data winter.” After a bubble, when people invested and a lot of companies overpromised without providing serious analysis, it will bust. And soon, in a two- to five-year span, people will say, “The whole big-data thing came and went. It died. It was wrong.” I am predicting that. It’s what happens in these cycles when there is too much hype, i.e., assertions not based on an understanding of what the real problems are or on an understanding that solving the problems will take decades, that we will make steady progress but that we haven’t had a major leap in technical progress. And then there will be a period during which it will be very hard to get resources to do data analysis. The field will continue to go forward, because it’s real, and it’s needed. But the backlash will hurt a large number of important projects.
Big Data, Hype, the Media and Other Provocative Words to Put in a Title
SOURCE
I’ve found myself engaged with the Media recently, first in the context of a
“Ask Me Anything” (AMA) with reddit.com http://www.reddit.com/r/MachineLearning/comments/2fxi6v/ama_michael_i_jordan/ (a fun and engaging way to spend a morning), and then for an interview that has been published in the IEEE Spectrum.
That latter process was disillusioning. Well, perhaps a better way to say it is that I didn’t harbor that many illusions about science and technology journalism going in, and the process left me with even fewer.
The interview is here: http://spectrum.ieee.org/robotics/artificial-intelligence/machinelearning-maestro-michael-jordan-on-the-delusions-of-big-data-and-other-huge-engineering-efforts
Read the title and the first paragraph and attempt to infer what’s in the body of the interview. Now go read the interview and see what you think about the choice of title.
Here’s what I think.
The title contains the phrase “The Delusions of Big Data and Other Huge Engineering Efforts”. It took me a moment to realize that this was the title that had been placed (without my knowledge) on the interview I did a couple of weeks ago. Anyway who knows me, or who’s attended any of my recent talks knows that I don’t feel that Big Data is a delusion at all; rather, it’s a transformative topic, one that is changing academia (e.g., for the first time in my 25-year career, a topic has emerged that almost everyone in academia feels is on the critical path for their sub-discipline), and is changing society (most notably, the micro-economies made possible by learning about individual preferences and then connecting suppliers and consumers directly are transformative). But most of all, from my point of view, it’s a *major engineering and mathematical challenge*, one that will not be solved by just gluing together a few existing ideas from statistics, optimization, databases and computer systems.
I.e., the whole point of my shtick for the past decade is that Big Data is a Huge Engineering Effort and that that’s no Delusion. Imagine my dismay at a title that said exactly the opposite.
The next phrase in the title is “Big Data Boondoggles”. Not my phrase, nor my thought. I don’t talk that way. Moreover, I really don’t see anything wrong with anyone gathering lots of data and trying things out, including trying out business models; quite to the contrary. It’s the only way we’ll learn. (Indeed, my bridge analogy from later in the article didn’t come out quite right: I was trying to say that historically it was crucial for humans to start to build bridges, and trains, etc, etc, before they had serious engineering principles in place; the empirical engineering effort had immediate positive effects on humans, and it eventually led to the engineering principles. My point was just that it’s high time that we realize that wrt to Big Data we’re now at the “what are the principles?” point in time. We need to recognize that poorly thought-out approaches to large-scale data analysis can be just costly as bridges falling down. E.g., think individual medical decision-making, where false positives can, and already are, leading to unnecessary surgeries and deaths.)
Next, in the first paragraph, I’m implied to say that I think that neural-based chips are “likely to prove a fool’s errand”. Not my phrase, nor my thought. I think that it’s perfectly reasonable to explore such chip-building; it’s even exciting. As I mentioned in the interview, I do think that a problem with that line of research is that they’re putting architecture before algorithms and understanding, and that’s not the way I’d personally do things, but others can beg to differ, and by all I means think that they should follow their instincts.
The interview then proceeds along, with the interviewer continually trying to get me to express black-and-white opinions about issues where the only reasonable response is “gray”, and where my overall message that Big Data is Real but that It’s a Huge Engineering Challenge Requiring Lots of New Ideas and a Few Decades of Hard Work keeps getting lost, but where I (valiantly, I hope) resist. When we got to the Singularity and quantum computing, though—areas where no one in their right mind will imagine that I’m an expert—I was despairing that the real issues I was trying to have a discourse about were not really the point of the interview and I was glad that the hour was over.
Well, at least the core of the article was actually me in my own words, and I’m sure that anyone who actually read it realized that the title was misleading (at best).
But why should an entity such as the IEEE Spectrum allow an article to be published where the title is a flat-out contradiction to what’s actually in the article?
I can tell you why: It’s because this title and this lead-in attracted an audience.
And it was precisely this issue that I alluded to in my response to the first question—i.e., that the media, even the technology media that should know better, has become a hype-creator and a hype-amplifier. (Not exactly an original thought; I know…). The interviewer bristled, saying that the problem is that academics put out press releases that are full of hype and the poor media types don’t know how to distinguish the hype from the truth. I relented a bit. And, sure, he’s right, there does seem to be a growing tendency among academics and industrial researchers to trumpet their results rather than just report them.
But I didn’t expect to become a case in point. Then I saw the title and I realized that I had indeed become a case in point. I.e., here we have a great example of exactly what I was talking about—the media willfully added some distortion and hype to a story to increase the readership. Having the title be “Michael Jordan Says Some Reasonable, But Somewhat Dry, Academic, Things About Big Data” wouldn’t have attracted any attention.
(Well “Michael Jordan” and “Big Data” would have attracted at least some attention, I’m afraid, but you get my point.)
(As for “Maestro”, usually drummers aren’t referred to as “Maestros”, so as far as that bit of hyperbole goes I’m not going to complain… :-).
Anyway, folks, let’s do our research, try to make society better, enjoy our lives and forgo the attempts to become media darlings. As for members of the media, perhaps the next time you consider adding that extra dollop of spin or hype… Please. Don’t.
Mike Jordan
Leave a Reply