"Big data" entered our language before anyone knew what it meant. So then we spent a lot of time discussing it: "Is it really about the ‘bigness’?”, “Isn’t it about non-relational data?”, “No wait, it’s about the the need for speed." This got boiled down to the three Vs (volume, variety, velocity), but then “big data” just meant three things, which didn’t clarify much at all.
So we, the tech community, are developing new vocabulary and distinctions, and in 2013, no one is going to say “big data” anymore. (Actually, given that Dilbert already skewered big data, its heyday may already be over.)
This is the life-cycle of any good buzzword. A buzzword is born when something so new and important is happening that we need to talk about it before we understand it; while it is still amorphous. It refers to a family of related concepts. Then we develop greater understanding and distinctions, and pretty soon you’re embarrassed for your colleague when he trots out last year’s buzzword (remember Web 2.0?).
So what is the crux of “big data”? Why is it so new and important that we have to talk about it with a buzzword? In short, we’re all freaking out because old bottlenecks recently got shattered, the new bottlenecks are us and our existing tools, and mad riches are visible just over the horizon. (And it’s not just about riches — there’s also massive potential for human improvement.   )
Processing throughput of large data volumes was part of the old bottleneck. Massively parallel supercomputers required expensive hardware, and were brittle, requiring extreme specialization. So their price was too high for most applications, but the data volume problem was getting worse because data is exploding.
Fortunately, Hadoop shattered the processing bottleneck, bringing near-linear scaling to many applications. Despite Hadoop being notoriously difficult to use, it is an order of magnitude easier than parallelizing computation manually, and it runs on inexpensive commodity hardware. The ever-decreasing cost of memory and storage also contributed to Hadoop’s success breaking through this bottleneck.
Relational data stores used to be the only real option. These stores were very bad at holding sparse or variable data, and poor at storing documents. They required fixed schemas, which effectively meant you needed to know at write-time how you would use your data in the future. This often doesn’t mesh well with a need to combine disparate data sources, data that is generated by humans, or an agile approach to data development.
Hadoop and NoSQL stores address these use cases by supporting read-time schemas and unstructured data (such as images, audio, and video). Since there is a high correlation between data volume and data variety, Hadoop stands out as a particularly important technology.
(By the way, I am not rooting for NoSQL vs. SQL any more than I’d root for drills vs. screwdrivers. Different tools for different jobs.)
There has always been a need for speed, but until recently (for most applications) it was acceptable and common to have a lag between data collection and data availability to the end-user — somewhere between 5 minutes and 24 hours. This was usually due to a complex ETL batch process. This lag started to become intolerable as data volume grew, and the engineering community created a combination of new tools — NoSQL stores for fast writing and reading at very large scale, and stream processing frameworks such as Twitter’s Storm.
Despite being a batch system with a lag, Hadoop has been important to the viability of high-velocity processing. Nathan Marz, creator of Storm, recommends using Storm with Hadoop so that Hadoop serves as the system of record which can recover from errors (human and machine) in a reasonable amount of time.
So we can now process a huge volume of variably structured data at fantastic speed. We should be able to find heretofore unfindable patterns, collect hand-over-fist money, and do great things. Right? Wrong. Why? Two reasons: (1) The tools, especially Hadoop, are still too hard to use, and (2) We don’t have enough people who understand data (data scientists) to extract important observations from our data.
Hadoop is Too Damn Hard
Hadoop was born as open-source software. It has been massively successful but as its popularity has grown, so has its complexity. There are now 11 official Apache Hadoop-related projects. There are also tons of higher-level languages for Hadoop (Pig, Hive, Scalding, Cascading, Scoobi, Crunch, Scrunch, Spark, Cascalog, etc.). Some folks, like Hortonworks, Cloudera, and HP, are actively focused on making Hadoop easier, but multi-day training sessions are still required, and time-to-productivity with Hadoop is usually counted in months.
With the amount of innovation that’s going into Hadoop right now, it’s going to be a long time before it’s a settled, mature technology that is easy to use. But Hadoop is the primary big data enabler and the bottleneck, which is why Yahoo/Oracle/Microsoft/Facebook/IBM/eBay/Twitter/Yelp/LinkedIn/Netflix/Foursquare/Amazon and countless others are hopping on board.
Data Scientist Shortage
The other new bottleneck is the lack of people who understand data and can extract meaning from it. McKinsey anticipates that by 2018 we’ll have a shortage of 190,000 data scientists.
I expect this bottleneck is going to be with us for a long time, and I expect data scientist salaries to reflect this shortage.
Aside from ramping up training and paying high salaries in a competitive marketplace, what else can we do? Off-the-shelf, business-user-friendly machine learning (ML) tools will relieve data scientists of a lot of simpler work. In the next few years I expect to see a lot of growth in ML on very large data sets.
The Hadoops the Hadoop-Nots
Many companies are already using Hadoop to dramatically improve their bottom line (Walmart, LinkedIn, Yahoo, Sears…). The market is absolutely exploding, and the stakes are high and getting higher, faster and faster.
Which brings us to the big big data freak-out. Many companies are getting left behind on the other side of the big data divide, and are watching their competitors pull further and further ahead. Hadoop is too difficult to use, they don’t have the resources to hire Hadoop expertise, nor time train up Hadoop expertise (and anyway as Big Data Borat tweeted, "Give man Hadoop cluster he gain insight for a day. Teach man build Hadoop cluster he soon leave for better job."). On top of that, these companies also lack the data scientists necessary to extract meaning from the data. So they feel like they’re drowning in big data and watching the rescue boat slowly drive away.
Tools vendors are also freaking out. Some because they are rushing to claim a piece of this magically growing market, and some because their legacy tools are rapidly losing relevance. It’s a mad scramble to put “Big Data” and “Hadoop” on every tool, knowing that desperate Hadoop-Nots will spend their last dollar to get on the boat.
So we needed a way to talk about this unfolding story: the new bottlenecks, the huge new insights from data, the sky-rocketing market, and the companies pulling ahead, the companies being left behind, the panic and the euphoria — and we called it Big Data.
- K Young
K is CEO of Mortar, an on-demand easy-to-use platform for using Hadoop in the cloud. Interested in big data? Join us! www.mortardata.com/#!/jobs