Top 5 Reasons Not to Use Hadoop for Analytics
As a former diehard fan of Hadoop, I LOVED the fact that you can work on up to Petabytes of data. I loved the ability to scale to thousands of nodes to process a large computation job. I loved the ability to store and load data in a very flexible format. In many ways, I loved Hadoop, until I tried to deploy it for analytics. That’s when I became disillusioned with Hadoop (it just "ain't all that").
At Quantivo, we’ve explored many ways to deploy Hadoop to answer analytical queries (trust me – I made every attempt to include it in my day job). At the end of the day, it became an exercise much like trying to build a house with just a hammer - Conceivably, it’s possible, but it’s unnecessarily painful and ridiculously cost-inefficient to do.
Let me share with you my top reasons why Hadoop should not be used for Analytics.
1 - Hadoop is a framework, not a solution – For many reasons, people have an expectation that Hadoop answers Big Data analytics questions right out of the box. For simple queries, this works. For harder analytics problems, Hadoop quickly falls flat and requires you to directly develop Map/Reduce code directly. For that reason, Hadoop is more like J2EE programming environment than a business analytics solution.
2 - Hive and Pig are good, but do not overcome architectural limitations – Both Hive and Pig are very well thought-out tools that enable the lay engineer to quickly being productive with Hadoop. After all, Hive and Pig are two tools that are used to translate analytics queries in common SQL or text into Java Map/Reduce jobs that can be deployed in a Hadoop environment. However, there are limitations in the Map/Reduce framework of Hadoop that prohibit efficient operation, especially when you require inter-node communications (as is the case with sorts and joins).
3 - Deployment is easy, fast and free, but very costly to maintain and develop – Hadoop is very popular because within an hour, an engineer can download, install, and issue a simple query. It’s also an open source project, so there are no software costs, which makes it a very attractive alternative to Oracle and Teradata. The true costs of Hadoop become obvious when you enter maintenance and development phase. Since Hadoop is mostly a development framework, Hadoop-proficient engineers are required to develop an application as well as optimize it to execute efficiently in a Hadoop cluster. Again, it’s possible but very hard to do.
4 - Great for data pipelining and summarization, horrible for AdHoc Analysis – Hadoop is great at analyzing large amounts of data and summarizing or “data pipelining” to transform the raw data into something more useful for another application (like search or text mining) – that’s what’s it’s built for. However, if you don’t know the analytics question you want to ask or if you want to explore the data for patterns, Hadoop becomes unmanageable very quickly. Hadoop is very flexible at answering many types of questions, as long as you spend the cycles to program and execute MapReduce code.
5 - Performance is great, except when it’s not – By all measures, if you wanted speed and you are required to analyze large quantities of data, Hadoop allows you to parallelize your computation to thousands of nodes. The potential is definitely there. But not all analytics jobs can easily be parallelized, especially when user interaction drives the analytics. So, unless the Hadoop application is designed and optimized for the question that you want to ask, performance can quickly become very slow – as each map/reduce job has to wait until the previous jobs are completed. Hadoop is always as slow as the slowest compute MapReduce job.
That said, Hadoop is a phenomenal framework for doing some very sophisticated data analysis. Ironically, it’s also a framework that requires a lot of programming effort to get those questions answered.
My colleague David Starke has written a whitepaper which details the differences between Quantivo, Hadoop and SQL approaches to analyzing customer behavior.