Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
Both Apache Hadoop and Apache Hive crowned new releases this week, and there are a number of interesting technical articles covering YARN, NFS access to HDFS, and Apache Flume. With so much happening so quickly in the Hadoop-ecosystem, it can be a difficult to keep up — so please let me know if I missed anything, and I’ll include it next week.
Technical
Apache HDFS is getting support for the Network FileSystem (NFS) protocol. This an exciting new feature, and one of the authors working on the feature details the what, why, how, and when of Hadoop’s NFS support, which is being developed in trunk.
http://hortonworks.com/blog/simplifying-data-management-nfs-access-to-hdfs/
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
This week’s newsletter is a little lighter than normal in technical news (some fascinating articles, though!), but there are a quite a few interesting releases and upcoming events. Hope you enjoy, and please let me know if you find anything that I missed! Also, thanks to everyone that has been spreading the word about this newsletter — the number of new subscribers each week has been really encouraging.
Technical
LinkedIn has open-sourced a number of big data projects built on or to coexist with Hadoop. In celebration of LinkedIn’s 10th anniversary, this post covers 10 of those projects (such as Voldemort and DataFu), including a brief overview of each.
http://www.hadoopsphere.com/2013/05/hadoops-10-in-linkedins-10.html
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
There were two big and exciting releases this week from Hadoop vendors — Cloudera with Impala and MapR with M7. In addition, this week marks the 500th subscriber to Hadoop Weekly! Thanks everyone for subscribing, and please send anything my way that you think might make a good addition to this newsletter.
Technical
In the third and final part of his “Introduction to Hadoop” series, Tom White covers higher-level frameworks, anatomy of a Hadoop cluster, and data application pipelines. In terms of frameworks, he covers Pig, Hive, and Crunch (there’s a nice example of computing top-K with Crunch).
http://www.drdobbs.com/database/introduction-to-hadoop-real-world-hadoop/240153375/
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
The last full week of April was pretty busy for the Hadoop ecosystem — two core projects (Hadoop and HBase) saw releases, there was also some exciting funding news (congrats to Qubole!), and there were plenty of interesting technical articles.
Technical
The naming of components in Hadoop-related projects have often caused confusion (e.g. HDFS’ secondary namenode). Apache HBase is no exception — the HMaster is often misunderstood, because unlike its name suggests, not all writes go through the HMaster. This article elaborates on the role of the HMaster in HBase.
https://blogs.apache.org/hbase/entry/hbase_who_needs_a_master
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
There were a number of exciting announcements and releases this week (e.g. Hadoop on OpenStack, Impala 0.7) as well as some fantastic technical articles and tutorials. It’s great to see more technical articles about how folks are doing things with Hadoop — this week covering Hadoop internals, data formats, and MapReduce-based mobile UI customization. A big thanks to those that share their insights and experiences for making this newsletter possible!
Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
This week’s newsletter features fewer releases than normal (let me know if I missed something!) but has a lot of interesting technical articles. In addition, I’m pleased to announce the return of an events section. Thanks to the folks at Mortar Data for curating this list! They’ve found a number of great Hadoop-related events taking place all over the world this week.
Technical
Apache Pig provides support for expressive SQL-like join operations. In this post, Matthew Rathbone shows how to implement a left-outer join in Pig and write a unit test to check for correctness. This is his third article that demos a framework — he previously covered MapReduce and Hive. This trifecta is quite an interesting comparison, so be sure to read all three if you missed the previous articles.
http://blog.matthewrathbone.com/2013/04/07/real-world-hadoop—-implementing-a-left-outer-join-in-pig.html
News
April 2nd marked the 7-year anniversary of the first release of Apache Hadoop. In this post, Doug Cutting (the founder of Hadoop) provides 7 thoughts and predictions for Hadoop. He touches everything from open-source, to the name of the project, to where he sees Hadoop heading in the next 7 years.
http://blog.cloudera.com/blog/2013/04/seven-thoughts-on-hadoops-seventh-birthday/
Hadoop Weekly is a new (recurring) guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
News
Apache Hadoop’s Distributed File System and MapReduce were originally based upon research papers written by Google. Google owns a number of patents in these spaces, including 10 related to MapReduce. This week, they pledged “not to sue any user, distributor or developer of open-source software on specified patents, unless first attacked.”
http://google-opensource.blogspot.com/2013/03/taking-stand-on-open-source-and-patents.html
For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.
As the world has changed, so too has our data; an ever-increasing amount of data is now stored without a rigorous schema, or must be joined to outside data sets to be useful. Compounding this problem, often the amounts of data are so large that working with them on a traditional SQL database is so non-performant as to be impractical.
Enter Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop. (Hadoop is a massively parallel platform for processing the largest of data sets in reasonable amounts of time. Hadoop powers Facebook, Yahoo, Twitter, and LinkedIn, to name a few in a growing list.)
This then is a brief guide for the SQL developer diving into the waters of Pig Latin for the first time. Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.
You have MongoDB, so you have this tremendously scalable database. You’re collecting a ton of data, but you know you need to do more with it (okay, a lot more). You think you want to use Hadoop, but it doesn’t sound easy.
To keep it simple, we’ve divided the article into three parts:
Our CEO, K Young, spoke at PyData NYC abpit using real Python with Pig, and why we integrated these two awesome languages. The audience asked some great questions, many of which you can see at the end of the video.
Here is the video (with slides just below):
As many of you know, we’re building Mortar based on a fundamental belief that big data needs to get easier.
Processing big data has made incredible strides over the past decade. It would be hard to overstate the importance of the MapReduce programming model to this progress. Its simple design breaks work down and recombines it in a series of parallelizable operations making it incredibly scalable – today, Yahoo, Facebook and others run MapReduce jobs on tens of thousands of machines. Since MapReduce expects hardware failures, it can run on inexpensive commodity hardware, sharply lowering the cost of a computing cluster.
However, although MapReduce puts parallel programming within reach of most professional software engineers, developing MapReduce jobs isn’t exactly easy: (1) they require the programmer to think in terms of “map” and “reduce”, an unintuitive paradigm for most, (2) n-stage jobs can be difficult to manage, and (3) common operations (such as filters, projections, and joins) and rich data types require custom code.
This is why our friend Alan Gates and his former team at Yahoo! developed Apache Pig, which has two components:
- PigLatin – a simple yet powerful high-level data flow language similar to SQL that executes MapReduce jobs. PigLatin is often called simply “Pig”.
- Pig Engine – parses, optimizes, and automatically executes PigLatin scripts as a series of MapReduce jobs on a Hadoop cluster).
So why should you consider using Pig instead of raw MapReduce? Here are 8 big reasons:
On Buzzwords
“Big data” entered our language before anyone knew what it meant. So then we spent a lot of time discussing it: “Is it really about the ‘bigness’?”, “Isn’t it about non-relational data?”, “No wait, it’s about the the need for speed.“ This got boiled down to the three Vs (volume, variety, velocity), but then “big data” just meant three things, which didn’t clarify much at all.
So we, the tech community, are developing new vocabulary and distinctions, and in 2013, no one is going to say “big data” anymore. (Actually, given that Dilbert already skewered big data, it’s heyday may already be over.)
This is the life-cycle of any good buzzword. A buzzword is born when something so new and important is happening that we need to talk about it before we understand it; while it is still amorphous. It refers to a family of related concepts. Then we develop greater understanding and distinctions, and pretty soon you’re embarrassed for your colleague when he trots out last year’s buzzword (remember Web 2.0?).
So what is the crux of “big data”? Why is it so new and important that we have to talk about it with a buzzword? In short, we’re all freaking out because old bottlenecks recently got shattered, the new bottlenecks are us and our existing tools, and mad riches are visible just over the horizon. (And it’s not just about riches — there’s also massive potential for human improvement. [1] [2] [3])




