One of our primary goals at Mortar is to enable any engineer or data scientist to work with data at scale without having to deal with the complexities of a distributed system like Hadoop. Using Pig to develop data flow jobs goes a long way to achieving this goal, but there are times when you need to look under the hood to know what’s happening with your job. Tools like Lipstick go a long way to understanding what your job is doing, but it can still be difficult to understand why your job is failing or not making any progress at the MapReduce level. Today we’re happy to announce a new feature set in Mortar that will let you drill down and better understand what’s happening within individual MapReduce jobs.
As many of you know, we’re building Mortar based on a fundamental belief that big data needs to get easier.
Processing big data has made incredible strides over the past decade. It would be hard to overstate the importance of the MapReduce programming model to this progress. Its simple design breaks work down and recombines it in a series of parallelizable operations making it incredibly scalable – today, Yahoo, Facebook and others run MapReduce jobs on tens of thousands of machines. Since MapReduce expects hardware failures, it can run on inexpensive commodity hardware, sharply lowering the cost of a computing cluster.
However, although MapReduce puts parallel programming within reach of most professional software engineers, developing MapReduce jobs isn’t exactly easy: (1) they require the programmer to think in terms of “map” and “reduce”, an unintuitive paradigm for most, (2) n-stage jobs can be difficult to manage, and (3) common operations (such as filters, projections, and joins) and rich data types require custom code.
This is why our friend Alan Gates and his former team at Yahoo! developed Apache Pig, which has two components:
- PigLatin – a simple yet powerful high-level data flow language similar to SQL that executes MapReduce jobs. PigLatin is often called simply “Pig”.
- Pig Engine – parses, optimizes, and automatically executes PigLatin scripts as a series of MapReduce jobs on a Hadoop cluster).
So why should you consider using Pig instead of raw MapReduce? Here are 8 big reasons: