I had an interesting conversation yesterday with INSTAAR post-doc and statistical modeling whiz Chris Randin. We were discussing the relative importance of model fit and model predictive power in multivariate statistical models. Basically, I was arguing that we should optimize our models for predictive power and not for fit, especially because you’re probably going to overfit anyway. Chris countered convincingly that optimizing for fit and optimizing for predictive power give you different kinds of information.
To back up and give an example, say you have a response variable like nitrogen dioxide pollution levels across a landscape and you have a bunch of other predictor variables like temperature, moisture, vegetation greenness, and human population across the same landscape. You want to figure out interesting stuff about the relations between these predictor variables and the response variable. This is an increasingly common situation in our increasingly data-saturated world.
One thing you could figure out is the strength of the correlations between individual predictive variables and the response variable (nitrogen dioxide levels) using something like a scatterplot matrix like this one (just an example):
So, you fit one predictor to one response. Parameters are used to connect your predictors with the response and the accuracy of the match is your fit (r2). The slope and intercept coefficients are readily interpretable (good old y = mx + b).
It gets more complicated though when you start using multiple predictors. The parameters that are fit in the model are less easy to understand and the best fit will always improve when you add more terms to the model. The model with all variables and all of their interactions will always give the best fit (highest r2 or whatever other measure you are using). But statisticians have thought of this of course and so you can penalize the model for having additional terms, though I am a little skeptical of the techniques used to penalize because the magnitude of the penalty seems a little arbitrary (maybe I just don’t understand the genius behind the techniques though). Stepwise regression is a technique for going through a lot of potential model formulations with different variable combinations and deciding which one is best, though again, I am skeptical of the criteria for variable inclusion that go into stepwise-type procedures as are others. Most statistical modeling books have a very sad chapter that talks about how hard this process is and all of the heinous pitfalls involved.
But what if we define the best model as the one that has the best predictive power? Then the model selection has a really clear criterion. There are actually a lot of great techniques such as cross-validation and bootstrapping that allow you to test your model’s predictive power in really neat and innovative ways. We could test a whole lot of models using these tests of predictive power and get a printout of how well each model does using each of the different tests. I don’t think this approach is done very often, but if it’s prediction that you are interested in, then this may be the best way to go. Chris is right though that there may be other aspects of your data that might be better explored by optimizing for fit. I’ll probably have more to say on this as I do more of my own models with the data from my study site.