One of our primary goals at Mortar is to enable any engineer or data scientist to work with data at scale without having to deal with the complexities of a distributed system like Hadoop. Using Pig to develop data flow jobs goes a long way to achieving this goal, but there are times when you need to look under the hood to know what’s happening with your job. Tools like Lipstick go a long way to understanding what your job is doing, but it can still be difficult to understand why your job is failing or not making any progress at the MapReduce level. Today we’re happy to announce a new feature set in Mortar that will let you drill down and better understand what’s happening within individual MapReduce jobs.
Apache Hadoop 2.3.0 was released this week. It’s the first release since Hadoop 2 was declared GA last October. This week has a number of technical articles from folks sharing details on their big data pipelines, which I always find interesting.
A post entitled “Analytics at Github” describes the evolution of the GitHub analytics stack from a Rails and Cassandra-based system to one that uses Kestrel, S3, and Hadoop to process data which is stored in Cassandra and served via rails. The post follows the repository traffic graphs feature, but it describes how the system is general purpose.
There’s an old saying that explains why you should never wrestle with a pig. The adage, sometimes attributed to playwright George Bernard Shaw, warns that “you get dirty, and besides, the pig likes it.”
Of course, around the Mortar offices, and in data science circles, “Pig” refers to Apache Pig—a programming language designed for analyzing large data sets—more often than it refers to our porcine friends. But we don’t think you should have to wrestle with Pig, either. In fact, you should be bossing that Pig around. That’s why we created a handy, compact reference to speed your development along.
Many data scientists are working with data gathered from human beings using web applications. If they’re lucky, that data was gathered intentionally and is relatively clean.
Fortunately for most of these data scientists, human behavior is fairly consistent. And most of these data scientists have fairly similar goals for that data—which isn’t a surprise, given the origin of the data. For example, they might be trying to figure out how to encourage more people to subscribe to a service, or to finish the checkout process, or to buy more items on an e-commerce site. For the most part, we use data gathered from behavior in applications to make those applications better.
Personalized recommendations drive business, guiding people to the products they want, the news they need, and the videos and music they didn’t know they would love. Recommendations boost revenue and engagement tremendously at the businesses we admire: Netflix (75% of videos watched are from recommendations), Amazon (35% of sales driven by “Customers Who Bought This” and “Frequently Bought Together”), and LinkedIn (50% of new connections from “People You May Know”), to name just a handful.
Despite the obvious advantages, many companies either don’t have recommendations or don’t leverage their data to make good ones. When they look at off-the-shelf recommendation engines, they see black boxes that won’t customize to their business, lock them in to proprietary technology, and charge outsized upfront and ongoing costs. But when they look at building their own engine from scratch, the time, effort, and data science talent required for development is daunting.
Today, we’re changing that.
I only found a few technical articles this week (please send anything my way if I missed it!), but there were several releases from the Hadoop ecosystem to read up and try out. Of note, HBase 0.98 was released with an impressive set of new features and performance improvements. Elsewhere, the Apache Spark project passed a vote to graduate from the incubator, MapR announced that they’re expanding into Korea, and Intel announced a partnership with Big Data Partnership in Europe.
Hue is a web application for interacting with Hadoop. It’s included with distributions from many of the leading Hadoop vendors. Hue’s authentication layer is pluggable, supporting OpenID, OAuth, LDAP, and more. A post on the Cloudera blog walks through the steps necessary to configure Hue to use LDAP as the authentication backend.
There were a lot of product releases and announcements in the ecosystem this week as folks met in Santa Clara for StrataConf. Among the highlights were announcements from MapR related to their distribution and a partnership with HP, a new beta of Cloudera’s CDH5, and the public preview of Hadoop 2 on Windows Azure. In addition, there are a number of interesting technical articles about HBase, MapReduce v2, Pig, Hadoop security, and more. Congrats to folks on all the releases, partnerships, and great articles. Also, a big congrats to Splice Machine for raising a new round financing.
The The Cloudera blog has an interesting technical post about performance in MRv2. The post describes some of the major revamps that took place in MRv2, and it describes some performance regressions found by running the same jobs on both MRv1 and MRv2. The post walks through the low-level debugging that was done to identify the root cause of two of the issues, and it explains the fixes that were made. It’s a pretty technical overview including discussion of the `perf` tool, CPU cache latency, fadvise, and more.
StrataConf is this week in Santa Clara, and there were a lot of announcements this week in anticipation. Among the most notable, Cloudera announced new packaging for their enterprise software and support for Apache Spark 0.9.0. Spark was a hot topic this week—version 0.9.0 was released and it was covered in an interview with Doug Cutting and on the Monash Research blog. We also get a peek inside several data pipelines this week with a post covering big data at Stripe, Tapad, Etsy, and Square as well as details on the Hootsuite log pipeline. Also don’t miss Ramya Sunil’s post on women in the Hadoop open-source ecosystem.
The Hortonworks blog has a walkthrough of building an HDP cluster with Amazon Web Service’s EC2. The tutorial starts with creating a custom AMI, describes installing password-less private keys (something you likely don’t want to do for a production system), starting the Apache Ambari server, and using Ambari to provision the cluster. The tutorial is loaded with screen shots to help you on your way.
There were a lot of big announcements this week, including the release of MapR 3.1 and Alticscale’s public launch. There were also a number of articles around the promise of Hadoop 2.0/YARN—hopefully folks will start sharing their production success stories soon, too. And Next Big Sound has shared a detailed look into their big data architecture, which is one of several interesting technical reads. Enjoy!
Eric Czech, Chief Architect of Next Big Sound, has written about the evolution of their big data infrastructure. The post is full of interesting information from their network topology to colo provider to how they store and version data in HDFS to analysis with Pig to serving data stored in HBase using Finagle.
Sometimes it’s nice to raise the code quality bar a smidge. Just because something seems to work OK doesn’t mean that it’s the best way.
The Single-Row Cross Anti-Pattern
Suppose you sell items, and you want to know how much each item contributes to total sales. You could do this:
Hortonworks announced this week that HDP 2.0 for Windows is GA, which brings YARN to Windows. This week’s issue also contains two articles about Hadoop security—a topic that’s been discussed a lot in recent weeks. Software maintainers were quite busy the past week or so, too—I’ve highlighted nine releases in this issue. Overall, there’s a quite a bit of interesting content this week, so enjoy!
When optimizing or debugging Hadoop, it can be really useful to understand the underlying architecture. This post covers HDFS’s block replication and placement policy. After walking through the default placement algorithm, it goes through an example of writing files and inspecting blocks on a cluster spread across three racks. The post covers using `hadoop fsck` to find block locations, which is a really useful tool for administrating HDFS.
I really appreciate the positive response from everyone as I highlighted the one year anniversary of Hadoop Weekly last week—thanks! I expect the next 12 months to be just as busy and exciting! Speaking of, this week’s issue features some great content: an interesting story about Hike’s Hadoop infrastructure, a detailed benchmark of Cloudera’s Impala, and details on this years HBaseCon. There were also a number of interesting product releases, including a Hadoop FileSystem implementation of the Google Cloud Storage service.
Doug Cutting, creator of Apache Hadoop, recently spoke at The Hive (not to be confused with Apache Hive) about “The Future of Data.” At the talk, Doug made a number of predictions about the future of big data rooted in Hadoop. The predictions, e.g. that Hadoop will see OLTP and threaten interactive analytics data warehouse systems, aren’t necessarily new (Cloudera has been touting the Enterprise Data Hub since Strata NYC). It’s worthwhile to read through some of Doug’s ideas, though—he’s a visionary in big data.
Our CEO, K Young, had a fun interview over on Joe Stein's All Things Hadoop podcast, and so we thought we’d transcribe it and share it for those of you who might have missed it. If you want more background on Mortar and where we came from, this is a great place to start.
Here’s a link to the audio, or you can just read the transcript below:
> Hello, and welcome to the All Things Hadoop Podcast. I’m your host, Joe Stein, founder and principal consultant of Big Data Open Source Security LLC. This is episode 11, a talk with K Young, co-founder and CEO of Mortar Data. And now, onto the show.
JS: I’d like to welcome to the podcast K Young. K is the CEO and Founder of Mortar. Welcome, K.
KY: Hi, thanks.
JS: So, tell us; how did you get into Hadoop?
This issue marks a complete year (well, 52 weeks) of Hadoop Weekly. Thanks to all the authors and writers for generating so much content—without it this newsletter couldn’t exist. And I’d also like to thank all the subscribers for spreading the word! It’s pretty awesome that over 1500 people receive this weekly email.
With that said, there’s a great issue below. SQL-on-Hadoop continues to heat up with updates on the stinger initiative, Presto-as-a-Service from Qubole, and IBM’s BigSQL. There are also some updates on writing YARN applications, and there are a handful of releases / new projects to check out.
Cloudera employee and former Oracle DBA, Gwen Shapira, has written an FAQ for Oracle DBAs thinking about learning Hadoop. The post covers four questions, including some tips for important skills if you don’t have any Hadoop experience but are looking for a Hadoop job.
Welcome to the first issue of Hadoop Weekly in 2014. This issue features a few year-in-review/2014 preview articles and a couple of posts tying that theme to Hadoop 2/YARN. There are also a number of interesting technical articles covering running Hadoop in Linux containers, HBase performance tuning, R & Hadoop, and more. From a release standpoint, there are two interesting new projects to try out—SIMR for running Spark on a MapReduce-v1 cluster and PigPen for writing MapReduce jobs (which are translated to Pig) in clojure.
HUE, the front-end for several components in the Hadoop ecosystem, has added a Spark application. The app takes advantage of the Spark Server’s REST API to submit jobs and monitor status. The HUE blog has an introduction describing the functionality and required setup/configuration. There’s also a short video demoing the new application.