August 19, 2014

K Young

image

If you’ve used Hadoop, you know that the overhead time necessary to provision and run small jobs can be painful. Most likely you kill time every time you test something by grabbing coffee, and pretty soon your hands are shaking from all that testing.

It doesn’t have to be like this. As of today you can run small jobs from Mortar in seconds. How? Choose to execute your job without a cluster, and we’ll skip provisioning and distributed computation—so you can get answers fast.

Read More

John Matson

image

Open-source is key to everything we do at Mortar. Our award-winning platform would not be possible without Apache Hadoop or Pig, and it would not be as powerful without Lipstick (open-sourced by Netflix) or Luigi (Spotify).

So we’re always pleased when we can make a meaningful contribution back to the community by open-sourcing something of our own, such as when we extended Pig to work with Python. Now we’re adding more by open-sourcing our code for writing to DynamoDB.

Read More

Cat Miller and Jeremy Karn

imageimage

You may not know it, but Pig lives all around you. LinkedIn, Twitter, Netflix, Salesforce… These internet giants (and many others) all use Apache Pig to help make sense of the massive amounts of data they generate on a daily basis.

It’s relatively well known that Pig is great for working with unstructured data (Pigs Eat Anything, per the official Apache Pig Philosophy), that it’s flexible and extensible (Pigs Are Domestic Animals), and that it sails through massive data sets with ease (Pigs Fly). That’s all true, but we’ve also stumbled onto several cool features of Pig that aren’t as well known. We compiled the list below to share some of the Piggy goodness.

Read More

May 23, 2014

K Young

image

People Love Redshift
People love Redshift because it nailed the tech-trifecta: it’s cheap, it delivers, and it’s available instantly with zero commitment.

If you’re not familiar with Redshift, it is AWS’s on-demand data warehouse. Data warehouses are for large-scale reporting and data analysis, and are crucial to most sizable businesses. Redshift’s competitors have excellent products, but they cost 10-100x more money, and sales and procurement take months and lock you in for years.

There’s Just One Problem: Loading
But as anyone who’s tried to use Redshift knows, there’s one glaring problem: it’s a huge pain to load your data into Redshift in the first place. Or it was before today.

Read More

John Matson

image

A few weeks ago we made an announcement: our recommendation engine, which our engineers and data science advisors built in consultation with select partner companies, is now open source and available to all. The response from the community has been tremendous—tens of thousands of people read our blog post announcing the news. And many of them dove right into our Github repo and tutorials and got to work.

If your company is among the many that use MongoDB, we’ve just made the path to generating personalized recommendations even smoother.

Read More

Mark Roddy

image

One of our primary goals at Mortar is to enable any engineer or data scientist to work with data at scale without having to deal with the complexities of a distributed system like Hadoop. Using Pig to develop data flow jobs goes a long way to achieving this goal, but there are times when you need to look under the hood to know what’s happening with your job. Tools like Lipstick go a long way to understanding what your job is doing, but it can still be difficult to understand why your job is failing or not making any progress at the MapReduce level. Today we’re happy to announce a new feature set in Mortar that will let you drill down and better understand what’s happening within individual MapReduce jobs.

Read More


Apache Pig

There’s an old saying that explains why you should never wrestle with a pig. The adage, sometimes attributed to playwright George Bernard Shaw, warns that “you get dirty, and besides, the pig likes it.”

Of course, around the Mortar offices, and in data science circles, “Pig” refers to Apache Pig—a programming language designed for analyzing large data sets—more often than it refers to our porcine friends. But we don’t think you should have to wrestle with Pig, either. In fact, you should be bossing that Pig around. That’s why we created a handy, compact reference to speed your development along.

Today we’re proud to share our Pig cheat sheet (pdf) with the community.

Read More

Now that 2013 is coming to a close, we’ve been doing a lot of reflecting. It has been an awesome year at Mortar, and we’ve truly enjoyed trying to bring you the very best Hadoop, Pig, and data science content.

We know it’s tough to keep up with every blog post we write (not to mention the dozens of other blog posts you’ve still got saved for later), so as thanks for keeping tabs on us, we wanted to share our most popular posts from 2013. If you’ve read all of these already, well I guess we’ll have to get you something even nicer in the New Year.

Read More

image

For a long time, data scientists and engineers had to choose between leveraging the power of Hadoop and using Python’s amazing data science libraries (like NLTK, NumPy, and SciPy). It’s a painful decision, and one we thought should be eliminated.

So about a year ago, we solved this problem by extending Pig to work with CPython, allowing our users to take advantage of Hadoop with real Python (see our presentation here). To say Mortar users have loved that combination would be an understatement.

However, only Mortar users could use Pig and real Python together…until now.

Read More

Following up on his excellent talk on Pig vs. MapReduce, Donald Miner spoke to the NYC Data Science Meetup about using Hadoop for data science. If you can set aside the time to watch, it’s a terrific and detailed talk. However, if you’re pressed for time, you can use our time-stamped summary to skip to specific sections. (Quick note: The video ran out towards the end of the Q&A, but the audio is still perfect.)

Here’s the summary of Don’s talk, with the video, slides, and the full transcript below:

Read More

image

Netflix kicked off the first session at this summer’s Hadoop Summit, telling the crowd about the Hadoop stack that powers its world-renowned data science practice. The punchline: they run everything on the Amazon Web Services cloud—Amazon S3, Elastic MapReduce (EMR), and their platform-as-a-service, Genie.

Putting S3 at the base of your Hadoop strategy, as Netflix and Mortar have, catapults you past many of the Hadoop headaches others will face.  No running out of storage unexpectedly: you get (essentially) infinite, low cost storage from S3, with frequent price cuts. No need to worry about your data: Amazon estimates they might lose one of your objects every 10 million years or so.  And best of all, no waiting in line behind other people’s slow jobs: spin up your own personal cluster whenever you want and point it at the same underlying S3 files.

A lot of these benefits come directly from S3.  It’s a pretty magical technology, and we use it extensively at Mortar.  There are some tricks we’ve learned to get the best performance out it in conjunction with Hadoop. I’m going to share those with you now; some can improve your performance 10X or more.

Read More

Tags

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Two of the most prominent Hadoop distributions, Cloudera’s CDH and Hortonwork’s HDP, both saw releases this week.  There are a few interesting new projects and some details on recent releases (Hive and SyncSort) as well as the normal slew of technical articles about various components in the ecosystem (Zookeeper, Cassandra, HBase).  We’re also celebrating the 20th issue of Hadoop Weekly with our 600th subscriber— thanks for spreading the word!

Technical
Zookeeper provides a set of powerful primitives for distributed consensus and locking, but there are a lot of edge cases and gotchas to consider when using it. The Apache incubator project Curator is a framework that addresses most of the edge cases and also implements several common recipes. This blog posts talks about some of the edge cases that are addressed in Curator, which should motivate you to use it rather than using the Zookeeper API directly.
http://blog.cloudera.com/blog/2013/05/zookeeper-made-simpler/

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

There were a lot of exciting announcements this week, including Hortonworks announcing General Availability of the HDP for Windows and Concurrent announcing its new Pattern framework for machine learning on Hadoop. There are also a bunch of interesting technical articles about recent releases — Phoenix, HUE, Kiji, CQL, and more. Hope you enjoy!

Technical
Phoenix is a SQL layer atop of Apache HBase from Salesforce. The latest release includes support for skip scans, which increase performance 3x-20x over a batched get. Skip scans utilize information about the query’s key-range to perform server-side skips over un-interesting parts of the key range (the exact details are a bit more complex, and there’s a good overview in this post). In addition to an overview, they have a performance analysis given a few different dataset characteristics.
http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Both Apache Hadoop and Apache Hive crowned new releases this week, and there are a number of interesting technical articles covering YARN, NFS access to HDFS, and Apache Flume. With so much happening so quickly in the Hadoop-ecosystem, it can be a difficult to keep up — so please let me know if I missed anything, and I’ll include it next week.

Technical
Apache HDFS is getting support for the Network FileSystem (NFS) protocol. This an exciting new feature, and one of the authors working on the feature details the what, why, how, and when of Hadoop’s NFS support, which is being developed in trunk.
http://hortonworks.com/blog/simplifying-data-management-nfs-access-to-hdfs/

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

This week’s newsletter is a little lighter than normal in technical news (some fascinating articles, though!), but there are a quite a few interesting releases and upcoming events. Hope you enjoy, and please let me know if you find anything that I missed! Also, thanks to everyone that has been spreading the word about this newsletter — the number of new subscribers each week has been really encouraging.

Technical
LinkedIn has open-sourced a number of big data projects built on or to coexist with Hadoop. In celebration of LinkedIn’s 10th anniversary, this post covers 10 of those projects (such as Voldemort and DataFu), including a brief overview of each.
http://www.hadoopsphere.com/2013/05/hadoops-10-in-linkedins-10.html

Read More