What’s better than a recommendation engine that’s free? A recommendation engine that is both awesome and free.

Today, we’re announcing General Availability for the Mortar Recommendation Engine. Designed by Mortar’s engineers and top data science advisors, it produces personalized recommendations at scale for companies like MTV, Comedy Central, StubHub, and the Associated Press. Today, we’re giving it away for free, and it is awesome.

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

It was a busy week full of news and releases thanks to Hadoop Summit EU, which took place in Amsterdam last week. Hortonworks, IBM, and Cloudera announced new versions of their distributions, and several ecosystem projects, including Oozie and Tez, had new releases. Tajo, the SQL-on-Hadoop system, graduated from the Apache incubator, and there are a number of technical posts covering many different Hadoop-related topics.

Technical
As a YARN application, Apache Tez is easy to deploy (just putting a few jars and config files in HDFS). This post further explores the deployment details including how the setup makes it easy to do rolling upgrades. It also goes through the details of other key design aspects of Tez—failure handling and global optimization.
https://github.com/t3rmin4t0r/notes/wiki/I-Like-Tez,-DevOps-Edition-(WIP)

Read More

The following is cross-posted from the blog of Mortar Data user Dave Fauth. Dave is a senior architect and systems engineer at Intelliware Systems. You can follow Dave on Twitter at @davefauth.

I’m starting a new series on analyzing publicly available large data sets using Mortar. In the series, I will walk through the steps of obtaining the data sets, writing some Pig scripts to join and massage the data sets, adding in some UDFs that perform statistical functions on the data and then plotting those results to see what the data shows us.

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Hortonworks announced a new round of funding this week, and Intel and Cloudera announced a major new partnership. There’s a lot of money being put into the Hadoop ecosystem, which is rapidly changing. Lots of articles this week cover the evolving set of frameworks making up Hadoop data pipelines like Storm and Spark.

Technical
The Cloudera blog has a guest post about Spark Streaming from engineers at Sharethrough. The post walks through their migration from a batch-processing system using Scalding to a micro-batch system using Spark Streaming. The new architecture means that data is reflected in the system withins seconds rather than an hour. The post goes into some of the technical details and lessons learned during their migration.
http://blog.cloudera.com/blog/2014/03/letting-it-flow-with-spark-streaming/

Read More

Recently Drew Conway gave a great talk to the NYC Data Science Meetup group, examining the field of data science through the lens of the social scientist. Drew knows the intersection of those fields well—he trained as a social scientist, receiving a PhD in political science from NYU in 2013, and is now Head of Data at Project Florida. His talk started with a look at how data scientists draw on skills from different disciplines to approach complex problems (featuring an appearance by his famed Data Science Venn Diagram). Drew then walked through an example from his own research, in which he and his collaborators used Mechanical Turk workers to perform text analysis tasks usually reserved for trained experts.

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Cloudera and Platfora both reported new rounds of funding this week, and MapR and Jaspersoft as well as Cloudera and Trifacta announced new partnerships. In addition, Pivotal introduced a new version of their distribution, Pivotal HD, and Microsoft announced the general availability of the newest version of HDInsight, which includes Apache Hadoop 2.2. With several interesting technical articles, this week’s newsletter should have something for everyone.

Technical
In a tutorial that combines Apache Pig, Cloudera Impala, and Microsoft Power BI, you’ll load a dataset describing on-time performance of flights in the US over the last 30 years. The data describing each flight is joined with carriers, planes, and airports in a Pig job. Next, Pig is used to do simple aggregate analysis. Finally, the tutorial walks through hooking up Microsoft PowerBI to data retrieved through Cloudera Impala in order to do more advanced analysis.
http://baboonit.be/blog/self-service-bi-with-pig-impala-and-powerbi

Read More

This post is the second in a series about translating SQL concepts into Pig. You can read the first post here.

SQL is a state of mind. Even for the Pig enthusiast who has gained some comfort in the porcine language, problem solving using SQL patterns may still be second nature. What could be a frustrating tendency of thinking in the wrong language is actually a boon, as long as you have some quick translations handy. Conveniently, we’ve got a SQL->Pig Cheat Sheet for just this purpose.

SELECT * FROM

For some reason, though SQL examples of SELECT * are everywhere, it’s not very often that a Pig example uses the syntax. Never fear, the convention does exist.

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

This issue of Hadoop Weekly is overflowing with top-notch technical articles. There’s coverage of several parts of the ecosystem, from Zookeeper to Oozie to YARN. In addition, Kafka, Zookeeper, and Tez saw releases this week, and new features of the Kafka and Tez were releases were detailed in depth.

Technical
Episode 19 of the All Things Hadoop podcast has an interview with Adam Fuchs, Apache Accumulo PMC member and committer. The podcast covers the Accumulo data model, implementation, client-server architecture and more.
http://allthingshadoop.com/2014/03/13/big-data-with-apache-accumulo-preserving-security-with-open-source//

Read More

Whether you are a seasoned expert or a first-time user, we’re making it easier to work with Pig. Today we are proud to announce the load statement generator, a web tool that will craft your load statement, which imports your data into Pig, so that you can quickly move forward to a more exciting task.

Read More



Credit: Xavier Snelgrove via Wikimedia Commons

REPLs are a valuable tool for learning and exploration in a language. The “read-eval-print loop” provides immediate feedback and is easy to reset to a clean slate.

Mortar now provides a local Pig REPL, available through the mortar command line without needing to go through the considerable time and effort of installing Pig yourself.

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Hadoop Weekly surpassed 2,000 subscribers this week, and I’d like to mark the occasion by again thanking everyone who writes fantastic articles which make this newsletter possible. This issue is full of really interesting technical articles including a number that discuss the ever-expanding Hadoop ecosystem. Enjoy!

Technical
The Hortonworks blog has a post describing an integration between ElasticSearch, Flume, and Hadoop. The post includes the technical details for deploying the system, including Kibana which is an open-source UI for timestamped data stored in ElasticSearch. In addition to inserting data into ElasticSearch via Flume, the post includes information on the MapReduce-based libraries for batch indexing data for ElasticSearch.
http://hortonworks.com/blog/configure-elastic-search-hadoop-hdp-2-0/

Read More

One of our primary goals at Mortar is to enable any engineer or data scientist to work with data at scale without having to deal with the complexities of a distributed system like Hadoop. Using Pig to develop data flow jobs goes a long way to achieving this goal, but there are times when you need to look under the hood to know what’s happening with your job. Tools like Lipstick go a long way to understanding what your job is doing, but it can still be difficult to understand why your job is failing or not making any progress at the MapReduce level. Today we’re happy to announce a new feature set in Mortar that will let you drill down and better understand what’s happening within individual MapReduce jobs.

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Apache Hadoop 2.3.0 was released this week. It’s the first release since Hadoop 2 was declared GA last October. This week has a number of technical articles from folks sharing details on their big data pipelines, which I always find interesting.

Technical
A post entitled “Analytics at Github” describes the evolution of the GitHub analytics stack from a Rails and Cassandra-based system to one that uses Kestrel, S3, and Hadoop to process data which is stored in Cassandra and served via rails. The post follows the repository traffic graphs feature, but it describes how the system is general purpose.
http://johnnunemaker.com/analytics-at-github/

Read More


Apache Pig

There’s an old saying that explains why you should never wrestle with a pig. The adage, sometimes attributed to playwright George Bernard Shaw, warns that “you get dirty, and besides, the pig likes it.”

Of course, around the Mortar offices, and in data science circles, “Pig” refers to Apache Pig—a programming language designed for analyzing large data sets—more often than it refers to our porcine friends. But we don’t think you should have to wrestle with Pig, either. In fact, you should be bossing that Pig around. That’s why we created a handy, compact reference to speed your development along.

Today we’re proud to share our Pig cheat sheet (pdf) with the community.

Read More


Don't reinvent the wheel

Many data scientists are working with data gathered from human beings using web applications. If they’re lucky, that data was gathered intentionally and is relatively clean.

Fortunately for most of these data scientists, human behavior is fairly consistent. And most of these data scientists have fairly similar goals for that data—which isn’t a surprise, given the origin of the data. For example, they might be trying to figure out how to encourage more people to subscribe to a service, or to finish the checkout process, or to buy more items on an e-commerce site. For the most part, we use data gathered from behavior in applications to make those applications better.

Read More