You have MongoDB, so you have this tremendously scalable database. You’re collecting a ton of data, but you know you need to do more with it (okay, a lot more). You think you want to use Hadoop, but it doesn’t sound easy.
To keep it simple, we’ve divided the article into three parts:
- "WHY" explains the reasons for using Hadoop to process data stored in MongoDB
- "HOW" helps you get get set up
- "DEMO" shows you MongoDB and Hadoop working together. If you’re a tldr; type, you’ll want to start with this section.
Mongo was built for data storage and retrieval, and Hadoop was written for data processing. So naturally, data processing is often better offloaded to Hadoop. Here’s why:
Easier, more expressive languages
MongoDB supports native MapReduce, but MapReduce is a pain in the ass. The Hadoop community has created Pig, Hive, and the Cascading family of languages—all of which compile to MapReduce, but are expressive and high-level.
Libraries to build on
Big performance improvements
Hadoop is purpose-built for fully distributed, multi-threaded execution of data processing, so it performs much, much better for this sort of work.
Separate workloads mean less load
If you’re doing significant data processing on MongoDB, it can add substantial load. You may need an order of magnitude more power to process data than to store it, so it works really well to separate those concerns and separate those workloads.
So if you want to easily write distributed jobs that perform well and don’t add load to your primary storage system, Hadoop is probably the way to go.
Bonus: We’ve included ready-to-use Hadoop code to extract the “schema” of your MongoDB, and characterize how that schema is used in the demo section.
We used Mortar for this demo because it’s free for this purpose, and you won’t need to set up any infrastructure. Mortar is an open source framework for easily writing/developing your Hadoop jobs coupled with Hadoop-as-a-Service for running Mortar jobs on Hadoop clusters. If you’re going to use Mortar, skip to the DEMO section.
Otherwise, you need:
There are many ways to run Hadoop, here are a couple:
Set up your own cluster: You’ll need some machines and you’ll need to follow these instructions. Warning, this is not for the faint of heart, and probably should be reserved for companies with substantial resources and serious sys admin chops.
Use Amazon Elastic MapReduce (EMR): EMR is an offering by Amazon Web Services (AWS) that allows you to run a Hadoop-based job on a Hadoop cluster in the cloud. Aside from all of the typical cloud benefits that you get from doing this, you also get to skip the setup and configuration of a Hadoop cluster. There’s a step-by-step guide to setting up EMR here.
Data Processing Language
In this demo you’ll process your data on Hadoop with Apache Pig, a high-level data flow language that compiles down into Hadoop MapReduce jobs. It was designed to be easy to learn and simple to write. If you’ve written SQL, Pig will feel familiar—it is like procedural SQL. For more details on Pig, check out “8 reasons you should be using Apache Pig”.
If you don’t want to write a Pig script and would prefer to stick with raw Hadoop MapReduce jobs, the Mongo-Hadoop project will support that, but we won’t cover raw MapReduce in this article since it requires 10x more code without much benefit.
OK, now it’s time to choose an output destination for your processed data. You have a lot of options here:
If your want to write your final results back to MongoDB, the Mongo-Hadoop project also contains support for this with MongoStorage.
For documentation on how to do this read the “Storing Data to MongoDB” section under “Using MongoDB Data”.
One of the most useful output locations - If you’re running your Hadoop job on a cluster that’s running in AWS, then your data transfer will be extremely fast and you get all of the benefits of storing your data in the cloud.
If you’re running your own long-lived cluster, you can write the results to the cluster’s distributed file system. This can be useful if you’re going to be doing more processing on the data later.
The fun part! Here’s a quick step-by-step example that should take just a few minutes.
To get started, we’ve already set up a small MongoDB instance on MongoLab, populated it with a random sampling of Twitter data from a single day (around 120,000 tweets), and created a read-only user for you.
We’ve also set up a public Github repo with a Mortar project that has three Pig scripts ready to run. Here’s what you need to do:
If you don’t already have a free Github account - create one. You’ll need a github username in step 4.
Sign into (or create) your free Mortar account.
After you receive the confirmation email, log into Mortar at https://app.mortardata.com.
Install the Mortar Development Framework:
$ gem install mortar
(full installation details here)
Clone the example git project and register it as a mortar project:
$ git clone email@example.com:mortardata/mongo-pig-examples.git $ cd mongo-pig-examples $ mortar register mongo-pig-examples
Script 1 - Characterize Collection
If you’re like most MongoDB users, you may not have a great sense of the different fields, data types, or values in your collection. We built characterize_collection.pig to deeply inspect your collection to extract that information.
From the base directory of the mongo-pig-examples project you just cloned take a look at pigscripts/characterize_collection.pig. It loads all the data in the collection as a map, sends the map to Python (udfs/python/mongo_util.py) to gather a bunch of metadata, calculates some basic information about the collection, and then it writes the results out to an S3 bucket.
To see this script in action let’s run it on a 4 node Hadoop cluster. In your terminal (from the base directory of your mongo-pig-examples project) run:
$ mortar run characterize_collection --clustersize 4
This job will take about 10 minutes to finish. You can monitor the job’s status on the command line or by going to https://app.mortardata.com/jobs.
Once the job has finished, you’ll receive an email with a link to your job results. Clicking on this link will bring you into the Mortar web app, where you can download the results from s3. The output is described at the top of the characterize_collection script but as an example you can scroll down the output and find:
… user.is_translator 2 false unicode 118806 user.is_translator 2 true unicode 31 user.lang 26 en unicode 114108 user.lang 26 es unicode 3462 user.lang 26 fr unicode 532 user.lang 26 pt unicode 281 user.lang 26 ja unicode 79 user.listed_count 398 0 int 73757 user.listed_count 398 1 int 18518 …
Looking at the values for user.lang - we see that there are 26 unique values for the field in our dataset. The most common was “en” with 114108 occurrences, the next most common was “es” with 3462 occurrences, and so on. To see the full results without running the job you can view the output file here.
Script 2 - Mongo Schema Generator
It can be tricky to properly declare Mongo’s highly nested schemas in Pig. Now, Pig is graceful—it can roll without a schema, or with inconsistent, or incorrect schemas. But it’s easier to read and write your Pig code if you have a schema because it allows you (and the Pig optimizer) to focus on just the relevant data.
So this next script automatically generates a Pig schema by examining your MongoDB collection. If you don’t need the whole schema, you can easily edit it to keep just the fields you want.
Running this script is similar to running the previous one. If you ran the Characterize Collection script in the past hour, the same cluster you used for that job should still be running. In that case, you can just run:
$ mortar run mongo_schema_generator
If you don’t have a cluster that’s still running, just run the job on a new 4 node cluster like this:
$ mortar run mongo_schema_generator --clustersize 4
Script 3 – Twitter Hourly Coffee Tweets
Using the pigscripts/hourly_coffee_tweets.pig script, we’re going to demonstrate how we can use a small subset of the fields in our MongoDB collection. For our example, we’ll look at how often the word “coffee” is tweeted throughout the day. As with the Mongo Schema Generator script, you can run this job on an existing cluster or start up a new one.
If you already have a mongo instance/cluster based in US-East EC2, the first two example scripts should run on one of your collections with only minor modifications. You’ll just need to:
Update the MongoLoader connection strings in the pig scripts to connect to your MongoDB collections with one of your own users. If your mongo instance is on a non-standard port (any port other than 27017), just email us at firstname.lastname@example.org to allow your Mortar account to access that port.
If you’d like your jobs to write to one of your own S3 buckets, you can update the AWS keys associated with your Mortar account by following these instructions to enable s3 access.
If you run out of free cluster hours with Mortar, you can upgrade your account to get additional free hours each month.
You can find more resources for learning Pig here.