Writing Java functions for Pig can be daunting. While it ought to be simple (it’s just Java, right?), there isn’t a clear and easy path to writing Loaders and UDFs from scratch; the documentation and examples are scattered or assume prior knowledge, and the project setup is challenging.
To make it drop-dead easy to write UDFs and Loaders, we’ve created a maven project template with example classes for writing Pig functions. There are example UDF and Loader classes, as well as templates for various flavors of Pig Loader. The project builds and generates a usable jar file without any changes needed.
Our second NYC Data Science Meetup featured Tumblr data scientist Adam Laiacano, who discussed the analytics stack at Tumblr and the tools he and his team use to organize and analyze data.
Here are the video and slides from Adam’s talk, which cover Tumblr’s use of Scribe, Hive & Pig, Hue, and Vowpal Wabbit:
For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.
As the world has changed, so too has our data; an ever-increasing amount of data is now stored without a rigorous schema, or must be joined to outside data sets to be useful. Compounding this problem, often the amounts of data are so large that working with them on a traditional SQL database is so non-performant as to be impractical.
Enter Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop. (Hadoop is a massively parallel platform for processing the largest of data sets in reasonable amounts of time. Hadoop powers Facebook, Yahoo, Twitter, and LinkedIn, to name a few in a growing list.)
This then is a brief guide for the SQL developer diving into the waters of Pig Latin for the first time. Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.
You have MongoDB, so you have this tremendously scalable database. You’re collecting a ton of data, but you know you need to do more with it (okay, a lot more). You think you want to use Hadoop, but it doesn’t sound easy.
To keep it simple, we’ve divided the article into three parts:
"WHY" explains the reasons for using Hadoop to process data stored in MongoDB
"HOW" helps you get get set up
"DEMO" shows you MongoDB and Hadoop working together. If you’re a tldr; type, you’ll want to start with this section.
Working with data is HARD. Let’s face it, you’re brave to even attempt it, let alone make it your everyday job.
Fortunately, some incredibly talented people have taken the time to compile and share their deep knowledge for you.
Here are 7 books we recommend for picking up some new skills in 2013:
Mortar co-founder Jeremy Karn gave this talk on using MongoDB data with Hadoop (and specifically with Apache Pig) at MongoSV.
Jeremy’s presentation covers the steps needed to read JSON from Mongo into Pig, parallel process it on Hadoop with sophisticated functions, and write back to Mongo.
Jeremy was a big part of our contributions to the Mongo Hadoop connector, which we extended it to make it work with Pig. MongoDB creator (and 10gen founder) Dwight Merriman also gave Mortar a nice shout out.
Our CEO, K Young, spoke at PyData NYC about using real Python with Pig, and why we integrated these two awesome languages. The audience asked some great questions, many of which you can see at the end of the video.
Here is the video (with slides just below):
As many of you know, we’re building Mortar based on a fundamental belief that big data needs to get easier.
Processing big data has made incredible strides over the past decade. It would be hard to overstate the importance of the MapReduce programming model to this progress. Its simple design breaks work down and recombines it in a series of parallelizable operations making it incredibly scalable – today, Yahoo, Facebook and others run MapReduce jobs on tens of thousands of machines. Since MapReduce expects hardware failures, it can run on inexpensive commodity hardware, sharply lowering the cost of a computing cluster.
However, although MapReduce puts parallel programming within reach of most professional software engineers, developing MapReduce jobs isn’t exactly easy: (1) they require the programmer to think in terms of “map” and “reduce”, an unintuitive paradigm for most, (2) n-stage jobs can be difficult to manage, and (3) common operations (such as filters, projections, and joins) and rich data types require custom code.
This is why our friend Alan Gates and his former team at Yahoo! developed Apache Pig, which has two components:
- PigLatin – a simple yet powerful high-level data flow language similar to SQL that executes MapReduce jobs. PigLatin is often called simply “Pig”.
- Pig Engine – parses, optimizes, and automatically executes PigLatin scripts as a series of MapReduce jobs on a Hadoop cluster).
So why should you consider using Pig instead of raw MapReduce? Here are 8 big reasons: