If you’re here, you’ve probably already read this, but GigaOM‘s Derrick Harris wrote this great article discussing their WordPress Challenge and the fledgling overkill analytics approach I used to create the winning entry.
Forgive me for the self-indulgence, but here’s a quick excerpt:
And therein lies the beauty of overkill analytics, a term that Carter might have coined, but that appears to be catching on — especially in the world of web companies and big data. Carter says he doesn’t want to spend a lot of time fine-tuning models, writing complex algorithms or pre-analyzing data to make it work for his purposes. Rather, he wants to utilize some simple models, reduce things to numbers and process the heck out of the data set on as much hardware as is possible.
I’m pleased GigaOM found the overkill analytics characterization worth discussing. It’s brought a host of visitors to the site (at least compared to the readership I thought I’d have).
With the additional visits, though, I feel I should put a little more meat on the bones about my development philosophy. I’m sure this will be familiar territory for seasoned data scientists, but below are four principles that guide my approach:
- Spend most of your time engineering mass quantities of features. Synthesizing raw data about the subject into pertinent, lower-dimensional metrics always brings the biggest bang for your buck. More features with more diversity are always better, even if some are simplistic or unsophisticated.
- Spend very little of your time comparing, selecting, and fine-tuning models. A simple ensemble of many crude models is usually better than a perfectly-calibrated ensemble of a more precise models.
- Spend no time making your algorithm elegant, optimized, or theoretically sound. Use cheap servers and cheap tricks instead.
- Get results first; explain them later. The statistical algorithms available are powerful and unbiased – they will find the key elements before you do. Your job is to feed them mass quantities of features and then explain and interpret what they find, not guide them to a preconceived intuition about the answer.
To an extent, this is just a restatement of some best practices in predictive modeling. However, where overkill analytics comes in is to take these principles to new and hopefully productive extremes by leveraging cheap cloud computing power and rapid development principles. It is in this aspect where I hope to offer some innovation and new approaches.
Anyway, thanks for the visits. I will publish a couple more posts next week on the WordPress Challenge, one describing and ranking the full set of features I used, and one on the power of simple ensembles to improve results on this type of real-world problem.
As always, thanks for reading.
Great post Carter. I feel the same way about ml, but its good to hear your perspective. I like your emphasis on ec2 and I’d love to get a more detailed walk through of your machine learning process.
Specifically:
Do you ever use a local machine or is it all ec2? What do you develop in on ec2, I like ipython notebook, I’d imagine it would be quite friendly on amazon.
What kind of models do you employ and how many features can they handle for you. I use random forests a lot and find they usually start getting quite slow with more than a few thousand features. I know naive bayes can handle more, but I never get better performance.
I often find that features are highly ranked by random forest but don’t improve my cross validated performance. When I find this I often take them out. But maybe I shouldn’t worry about it so much. What’s your experience?
Anyways I realize this is a slew of questions, its just really cool to find someone talking about their kaggle style so openly. I feel like kaggle winners are the best to learn from because they don’t dwell too much on theoretical concerns. I’ll be happy to read about any more insights you have.
Thanks!
Benjamin:
Sorry for the late reply, just returning from vacation.
I use ec2 exclusively for side projects, but most of my day job uses local machines. For ec2 development, I’m a bit of a minimalist: I use Notepad++ and a tool called NppToR (which allows you to use hotkeys to copy text from Notepad++ to R or to a PuTTy terminal) on my local machine. I use ipython (just as an interpreter) and vanilla R over PuTTy on the remote server. I also install an NX server and RStudio on the remote machine and remote destop to it, but that’s typically just for visualizations. I’m going to explore ipython notebook, however, as I have been seeing a lot of good feedback on it.
My ‘first approximation’ on many problems is to use a very simple glm and random forest together. Per my overkill philosophy, I usually just try to expand hardware or use shortcuts to get past performance problems. Honestly, though, I typically don’t like to throw thousands of features in a single model. I prefer to use intermediate dimension-reducing steps whenever possible – ‘intermediate’ models (i.e., essentially ensemble components), clusters, domain-specific feature processing. Anything to keep the number of features in any one learning algorithm low (<100 as a rule of thumb). For a random forest, this may be as simple as lowering the # of variables per tree and increasing the number of trees. Per my new post, I like to make the solution broader rather than deeper.
I prefer to use ensemble methods to weed out noisy and irrelevant variables. Often, I'll use a glm with some feature selection criteria (stepwise or other), but then add a random forest or similar model with access to all possible features. I don't trust absolute 1/0 decisions on feature inclusion, and would rather wash out a noisy variable through ensemble techniques where possible.
Hope this is informative. Sorry if any of this is crypic or uses incorrect terms - I'm kind of a hack on this and often use machine learning language incorrectly. Happy to elaborate further here or on e-mail.
Thanks for the careful response @Carter especially in regards to feature selection (don’t) and dealing with lots of features. As for ipython notebook you might try using it as a better interpreter than ipython with a very shallow learning curve.