Monthly Archives: October 2012

More Is Always Better: The Power Of Simple Ensembles

This website’s goal is to develop and explain a data science philosophy – overkill analytics – that leverages computing scale and rapid development technologies to produce faster, better, and cheaper solutions to predictive modeling problems.   To achieve this goal, one core question must be answered: when attacking data science problems, how can we use CPU as a substitute for IQ?   This post will discuss the fundamental ‘overkill’ weapon for addressing this question – ensemble learning.

Ensembles are nothing new, of course; they underlie many of the most popular machine learning algorithms (e.g., random forests and generalized boosted models) .   The theory is that consensus opinions from diverse modeling techniques are more reliable than potentially biased or idiosyncratic predictions from a single source.    More broadly, this principle is as basic as “two heads are better than one.”   It’s why cancer patients get second opinions, why the Supreme Court upheld affirmative action, why news organizations like MSNBC and Fox hire journalists with a wide variety of political leanings…

Well, maybe the principle isn’t universally applied.   Still, it is fundamental to many disciplines and holds enormous value for the data scientist.   Below, I will explain why, by addressing the following:

  • Ensembles add value: Using ‘real world’ evidence from Kaggle and GigaOM‘s recent WordPress Challenge, I’ll show how very basic ensembles – both within my own solution and, more interestingly, between my solution and the second-place finisher – added significant improvements to the result.
  • Why ensembles add value:  For the non-expert (like me), I’ll give a brief explanation of why ensembles add value and support overkill analytics’ chief goal: leveraging computing scale productively.
  • How ensembles add value: Also for the non-expert, I’ll run through a ‘toy problem’ depiction of how a very simple ensemble works, with some nice graphs and R code if you want to follow along.

Most of this very long post may be extremely rudimentary for seasoned statisticians, but rudimentary is the name of the game at Overkill Analytics.   If this is old hat for you, I’d advise just reading the first section – which has some interesting real world results – and glancing at the graphs in the final section (which are very pretty and are useful teaching aids on the power of ensemble methods).

Read more …