On May 10-12, Professor Trevor Hastie has given TI students a glimpse of a new world. During the lectures, as we walked into his land of random forests, for a couple of hours we became cowboys with lassos and fishermen with elastic nets.
Tinbergen Institute Econometrics Lectures 2017
Interview with Trevor Hastie (The John A. Overdeck Professor, Professor of Statistics and Professor of Biomedical Data Science at Stanford University) by Oana Furtuna.
You have been a researcher in statistics in both the academic and the corporate sector. What are the main differences between the two worlds and what do you like best/least about each?
That’s interesting. Regarding the corporate sector, I was at Bell Labs, which was the research lab of AT&T, and it wasn’t quite the same as the corporate sector today; Bell Labs was famous for letting its researchers do whatever they wanted. This was done with the hope that they would produce something useful for the company, but it was not a requirement. Whereas today, if you work at Google or Facebook and you are in a research group, they expect you to come up with some deliverables pretty quickly, otherwise you will be nudged into another group. So that was a little different. At the time when I was at Bell Labs it was great because we got supported, we were able to go to conferences, they encouraged us in that.
I went to Bell Labs as soon as I graduated. If I had gone to university straight away I would have been teaching, I would have had to be writing grant proposals to get research money, and do research as well. In a way, the industry was much easier for me. And then, when I eventually went to university, then I had already done a lot of research and so I was able to get tenure immediately. By the time I left Bell Labs I had written two books already and after 8-9 years I was fairly well-established.
But getting back to the original question, I think it’s a little different today. Today, if you graduate with a PhD and you go into industry straight away, there are very few places that would not represent shutting a door. Maybe Microsoft research is still one of the places where you can be for a few years and then go back to academia if you wanted to, but there are not too many of them.
Related to the corporate sector, data science has become a buzzword nowadays. Do you think this is a hype, a fad that will disappear in some years?
I think there is a bit of hype around the term “data science”, but I don’t think there is hype around the need for smart, computationally-minded, applied statisticians and modelers in industry. I think they have a real need for that, because things like artificial intelligence (AI) are getting used more and more in all kinds of things (almost everything we use today is smart) and I think for that you need these kinds of skills. Right now it’s called data science, which is a catch-all for lots of things, but I think that the need will stay; what they will call it in the future, who knows?
Do you perceive the growing popularity of data science careers as a threat to the quality of the conducted analyses? Should practitioners be cognizant of all the statistical assumptions and numerical algorithms underlying the packages they use?
A lot of people are worried about the potential threat to the field of statistics. Because they feel like they are “stealing our turf”, as now computer science departments run data science programs. I am actually of the belief that if you want to be a good data scientist you need a firm statistical grounding, and if you don’t have that you won’t be a good data scientist in the long run. There are plenty of examples where people have implemented ideas, maybe the program was really good and all that, but the statistical underpinnings were weak and in the end it was not successful.
“if somebody has a good grounding in statistics they are going to be able to do a better job”
I would like to believe that if somebody has a good grounding in statistics they are going to be able to do a better job. I know plenty of very smart people, good programmers, but they don’t have the right sense about what methods they should be using and what not. And there is a huge array of statistical methods (people write papers all the time) many of which aren’t any good, so being able to select the right methods comes from good training in statistics and common sense in working with data. So you often see these smart people chasing after the wrong methods and wasting everyone’s time.
It feels that the competitive pay and the evident link between applied statistics/econometrics and data science entices many lead very talented PhD students to pursue an applied career in a company as opposed to an academic path. Do you think this is a loss for research in the field?
It’s a bit of a loss, yes. I am not sure what can be done about that, say, in the European community. I know in the U.S., for example, that at any given university the salary structure for professors can be different across fields and in some cases statistics is remunerated less than economics. That is going to change. With the rise of data science, statisticians will be respected a bit more and the salary will go up. I think in the European system salaries are much more standardized, though.
““You do pay a price to be an academic, but you get a lifestyle which is very nice””
All in all yes, you make much more money if you go into industry. Still, being an academic comes with a lifestyle that is important too. We do pay a price to be an academic, but you get a lifestyle which is very nice and you don’t get as stressed out as you do in industry, so there are two sides to the story.
So you do not feel that the research frontier is now mainly shifted to the hands of the corporate sector and that people who would have done research in a university will simply switch to conducting it in this corporate environment?
Not so much. Most corporations can’t really afford a genuine research group, unless it’s very big. You get some of these very big investment companies and they might have a research group, but in general corporations can’t afford that. So I think the research is generally still done at the universities. And these days, if you are a statistician or a data scientist at a university you can get really good consulting jobs with industry. Then you can have the best of both worlds. I consult for several companies, and it is not just that you make money, but you learn about new, interesting problems, which is nice.
Finally, what skills do you think are of paramount importance as an applied statistician?
A good load of common sense is one of the skills. Experience is always important, and I think you need to be comfortable programming. You need to be able to try things out yourself so you need an environment where you can do that, whether Matlab or Python or R (I would rather use R). That is really important, because when you see a problem you want to be able to get some data and try an idea out, to see if it makes sense. Then you need to have done some training in the core methods.
“A good load of common sense”
Regarding the methods themselves, it seems that the ‘big data’ approach is appreciated for its ability to fit and forecast/predict. However, this often occurs in absence of a clear causal link between the predictors and the outcomes. How important is data mining relative to having a structural model that can inform the estimation?
When you say “structure”, you are actually saying you would like to learn this causal pathway from the data. That is more in the realm of traditional statistical inference, where you understand the roles of variables and causality. In data mining it is harder, much harder.
Machine learning was developed in the computer science world and I don’t think causality was close to their heart, it was all about prediction. It is now coming to statistics and people are starting to think more about how these things can be used and incorporated into more traditional tasks. For example, one of the areas where we look for causality is trying to estimate treatment effects in observational data, traditionally using propensity scores or instrumental variables. That is a whole area of studying some aspects of causality and there is a lot of work going on right now that tries to incorporate these more modern methods in those techniques. We are working in that right now, so people are trying to do that.
Say we have obtained predictions. How can we use these for recommendations/policy? For example in health, how can you distinguish cause from effect?
That is harder. The specific example that I have been working on is using observational data from electronic health records to try and choose between two different treatments for heart disease. Both treatments had been approved by the FDA, in big clinical trials they both seemed to be as effective, but there is a sense in the community that for some patients one drug is better than the other, so this is an area of personalized medicine.
“Statistics is always in catch-up mode”
So for a subset of people defined by their ethnicity, age and maybe some other factors, we wanted to try to do a causal analysis to find which treatment works better. In this case, it amounted to the estimation of a treatment effect.
Finally, in what directions do you think the field of statistics can develop and grow? What are the main challenges you see for the field?
I think it has always been the case that statistics had to keep up with these other fields like machine learning, computer science. They operate more or less as engineers, coming up with new ideas often expressed in terms of algorithms, for solving a real problem.
“We have much more data, we should be able to learn more”
They find a real problem and they put together a solution that often works very well. Statistics is always in catch-up mode, trying to find a formulation which expresses the solution in a way that we can understand and incorporate it with other things we have done in the past. I think there is a big role for that and it is important that we do it, because otherwise we will be left behind.
Can statistics still be considered an independent field from computer science nowadays and if not, how do they differ?
The fields have gotten closer in the past, especially as we need to use computing more and more in statistics in dealing with bigger and bigger data. Statisticians, just like economists, have got other aspects of modelling in mind that are important to them, so the real question is how can one take from these new exotic methods and incorporate some of the ideas into things that we are used to do, in a way that can beef up what we are doing and make it more suitable for the modern day.