One of our primary goals at Mortar is to enable any engineer or data scientist to work with data at scale without having to deal with the complexities of a distributed system like Hadoop. Using Pig to develop data flow jobs goes a long way to achieving this goal, but there are times when you need to look under the hood to know what’s happening with your job. Tools like Lipstick go a long way to understanding what your job is doing, but it can still be difficult to understand why your job is failing or not making any progress at the MapReduce level. Today we’re happy to announce a new feature set in Mortar that will let you drill down and better understand what’s happening within individual MapReduce jobs.

Read More


Apache Pig

There’s an old saying that explains why you should never wrestle with a pig. The adage, sometimes attributed to playwright George Bernard Shaw, warns that “you get dirty, and besides, the pig likes it.”

Of course, around the Mortar offices, and in data science circles, “Pig” refers to Apache Pig—a programming language designed for analyzing large data sets—more often than it refers to our porcine friends. But we don’t think you should have to wrestle with Pig, either. In fact, you should be bossing that Pig around. That’s why we created a handy, compact reference to speed your development along.

Read More

Now that 2013 is coming to a close, we’ve been doing a lot of reflecting. It has been an awesome year at Mortar, and we’ve truly enjoyed trying to bring you the very best Hadoop, Pig, and data science content.

We know it’s tough to keep up with every blog post we write (not to mention the dozens of other blog posts you’ve still got saved for later), so as thanks for keeping tabs on us, we wanted to share our most popular posts from 2013. If you’ve read all of these already, well I guess we’ll have to get you something even nicer in the New Year.

Read More

image

For a long time, data scientists and engineers had to choose between leveraging the power of Hadoop and using Python’s amazing data science libraries (like NLTK, NumPy, and SciPy). It’s a painful decision, and one we thought should be eliminated.

So about a year ago, we solved this problem by extending Pig to work with CPython, allowing our users to take advantage of Hadoop with real Python (see our presentation here). To say Mortar users have loved that combination would be an understatement.

However, only Mortar users could use Pig and real Python together…until now.

Read More

Following up on his excellent talk on Pig vs. MapReduce, Donald Miner spoke to the NYC Data Science Meetup about using Hadoop for data science. If you can set aside the time to watch, it’s a terrific and detailed talk. However, if you’re pressed for time, you can use our time-stamped summary to skip to specific sections. (Quick note: The video ran out towards the end of the Q&A, but the audio is still perfect.)

Here’s the summary of Don’s talk, with the video, slides, and the full transcript below:

Read More

image

Netflix kicked off the first session at this summer’s Hadoop Summit, telling the crowd about the Hadoop stack that powers its world-renowned data science practice. The punchline: they run everything on the Amazon Web Services cloud—Amazon S3, Elastic MapReduce (EMR), and their platform-as-a-service, Genie.

Putting S3 at the base of your Hadoop strategy, as Netflix and Mortar have, catapults you past many of the Hadoop headaches others will face.  No running out of storage unexpectedly: you get (essentially) infinite, low cost storage from S3, with frequent price cuts. No need to worry about your data: Amazon estimates they might lose one of your objects every 10 million years or so.  And best of all, no waiting in line behind other people’s slow jobs: spin up your own personal cluster whenever you want and point it at the same underlying S3 files.

A lot of these benefits come directly from S3.  It’s a pretty magical technology, and we use it extensively at Mortar.  There are some tricks we’ve learned to get the best performance out it in conjunction with Hadoop. I’m going to share those with you now; some can improve your performance 10X or more.

Read More

Tags

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Two of the most prominent Hadoop distributions, Cloudera’s CDH and Hortonwork’s HDP, both saw releases this week.  There are a few interesting new projects and some details on recent releases (Hive and SyncSort) as well as the normal slew of technical articles about various components in the ecosystem (Zookeeper, Cassandra, HBase).  We’re also celebrating the 20th issue of Hadoop Weekly with our 600th subscriber— thanks for spreading the word!

Technical
Zookeeper provides a set of powerful primitives for distributed consensus and locking, but there are a lot of edge cases and gotchas to consider when using it. The Apache incubator project Curator is a framework that addresses most of the edge cases and also implements several common recipes. This blog posts talks about some of the edge cases that are addressed in Curator, which should motivate you to use it rather than using the Zookeeper API directly.
http://blog.cloudera.com/blog/2013/05/zookeeper-made-simpler/

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

There were a lot of exciting announcements this week, including Hortonworks announcing General Availability of the HDP for Windows and Concurrent announcing its new Pattern framework for machine learning on Hadoop. There are also a bunch of interesting technical articles about recent releases — Phoenix, HUE, Kiji, CQL, and more. Hope you enjoy!

Technical
Phoenix is a SQL layer atop of Apache HBase from Salesforce. The latest release includes support for skip scans, which increase performance 3x-20x over a batched get. Skip scans utilize information about the query’s key-range to perform server-side skips over un-interesting parts of the key range (the exact details are a bit more complex, and there’s a good overview in this post). In addition to an overview, they have a performance analysis given a few different dataset characteristics.
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Both Apache Hadoop and Apache Hive crowned new releases this week, and there are a number of interesting technical articles covering YARN, NFS access to HDFS, and Apache Flume. With so much happening so quickly in the Hadoop-ecosystem, it can be a difficult to keep up — so please let me know if I missed anything, and I’ll include it next week.

Technical
Apache HDFS is getting support for the Network FileSystem (NFS) protocol. This an exciting new feature, and one of the authors working on the feature details the what, why, how, and when of Hadoop’s NFS support, which is being developed in trunk.
http://hortonworks.com/blog/simplifying-data-management-nfs-access-to-hdfs/

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

This week’s newsletter is a little lighter than normal in technical news (some fascinating articles, though!), but there are a quite a few interesting releases and upcoming events. Hope you enjoy, and please let me know if you find anything that I missed! Also, thanks to everyone that has been spreading the word about this newsletter — the number of new subscribers each week has been really encouraging.

Technical
LinkedIn has open-sourced a number of big data projects built on or to coexist with Hadoop. In celebration of LinkedIn’s 10th anniversary, this post covers 10 of those projects (such as Voldemort and DataFu), including a brief overview of each.
http://www.hadoopsphere.com/2013/05/hadoops-10-in-linkedins-10.html

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

There were two big and exciting releases this week from Hadoop vendors — Cloudera with Impala and MapR with M7. In addition, this week marks the 500th subscriber to Hadoop Weekly! Thanks everyone for subscribing, and please send anything my way that you think might make a good addition to this newsletter.

Technical
In the third and final part of his “Introduction to Hadoop” series, Tom White covers higher-level frameworks, anatomy of a Hadoop cluster, and data application pipelines. In terms of frameworks, he covers Pig, Hive, and Crunch (there’s a nice example of computing top-K with Crunch).
http://www.drdobbs.com/database/introduction-to-hadoop-real-world-hadoop/240153375/

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

The last full week of April was pretty busy for the Hadoop ecosystem — two core projects (Hadoop and HBase) saw releases, there was also some exciting funding news (congrats to Qubole!), and there were plenty of interesting technical articles.

Technical

The naming of components in Hadoop-related projects have often caused confusion (e.g. HDFS’ secondary namenode). Apache HBase is no exception — the HMaster is often misunderstood, because unlike its name suggests, not all writes go through the HMaster. This article elaborates on the role of the HMaster in HBase.
https://blogs.apache.org/hbase/entry/hbase_who_needs_a_master

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

There were a number of exciting announcements and releases this week (e.g. Hadoop on OpenStack, Impala 0.7) as well as some fantastic technical articles and tutorials. It’s great to see more technical articles about how folks are doing things with Hadoop — this week covering Hadoop internals, data formats, and MapReduce-based mobile UI customization. A big thanks to those that share their insights and experiences for making this newsletter possible!

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

This week’s newsletter features fewer releases than normal (let me know if I missed something!) but has a lot of interesting technical articles. In addition, I’m pleased to announce the return of an events section. Thanks to the folks at Mortar Data for curating this list! They’ve found a number of great Hadoop-related events taking place all over the world this week.

Technical
Apache Pig provides support for expressive SQL-like join operations. In this post, Matthew Rathbone shows how to implement a left-outer join in Pig and write a unit test to check for correctness. This is his third article that demos a framework — he previously covered MapReduce and Hive. This trifecta is quite an interesting comparison, so be sure to read all three if you missed the previous articles.
http://blog.matthewrathbone.com/2013/04/07/real-world-hadoop—-implementing-a-left-outer-join-in-pig.html

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
 
Happy 7th birthday to Apache Hadoop! The first release of Hadoop was made in April 2006. This week’s newsletter caps that anniversary by representing many parts of the Hadoop ecosystem. It’s quite impressive how far the project and the ecosystem have come in those 7 short years!

News
April 2nd marked the 7-year anniversary of the first release of Apache Hadoop. In this post, Doug Cutting (the founder of Hadoop) provides 7 thoughts and predictions for Hadoop. He touches everything from open-source, to the name of the project, to where he sees Hadoop heading in the next 7 years.
http://blog.cloudera.com/blog/2013/04/seven-thoughts-on-hadoops-seventh-birthday/

Read More