As many of you know, we’re building Mortar based on a fundamental belief that big data needs to get easier.
Processing big data has made incredible strides over the past decade. It would be hard to overstate the importance of the MapReduce programming model to this progress. Its simple design breaks work down and recombines it in a series of parallelizable operations making it incredibly scalable – today, Yahoo, Facebook and others run MapReduce jobs on tens of thousands of machines. Since MapReduce expects hardware failures, it can run on inexpensive commodity hardware, sharply lowering the cost of a computing cluster.
However, although MapReduce puts parallel programming within reach of most professional software engineers, developing MapReduce jobs isn’t exactly easy: (1) they require the programmer to think in terms of “map” and “reduce”, an unintuitive paradigm for most, (2) n-stage jobs can be difficult to manage, and (3) common operations (such as filters, projections, and joins) and rich data types require custom code.
This is why our friend Alan Gates and his former team at Yahoo! developed Apache Pig, which has two components:
- PigLatin – a simple yet powerful high-level data flow language similar to SQL that executes MapReduce jobs. PigLatin is often called simply “Pig”.
- Pig Engine – parses, optimizes, and automatically executes PigLatin scripts as a series of MapReduce jobs on a Hadoop cluster).
So why should you consider using Pig instead of raw MapReduce? Here are 8 big reasons: