Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Both Apache Hadoop and Apache Hive crowned new releases this week, and there are a number of interesting technical articles covering YARN, NFS access to HDFS, and Apache Flume. With so much happening so quickly in the Hadoop-ecosystem, it can be a difficult to keep up — so please let me know if I missed anything, and I’ll include it next week.

Technical
Apache HDFS is getting support for the Network FileSystem (NFS) protocol. This an exciting new feature, and one of the authors working on the feature details the what, why, how, and when of Hadoop’s NFS support, which is being developed in trunk.
http://hortonworks.com/blog/simplifying-data-management-nfs-access-to-hdfs/

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

This week’s newsletter is a little lighter than normal in technical news (some fascinating articles, though!), but there are a quite a few interesting releases and upcoming events. Hope you enjoy, and please let me know if you find anything that I missed! Also, thanks to everyone that has been spreading the word about this newsletter — the number of new subscribers each week has been really encouraging.

Technical
LinkedIn has open-sourced a number of big data projects built on or to coexist with Hadoop. In celebration of LinkedIn’s 10th anniversary, this post covers 10 of those projects (such as Voldemort and DataFu), including a brief overview of each.
http://www.hadoopsphere.com/2013/05/hadoops-10-in-linkedins-10.html

Read More

If you want Hilary Mason, Drew Conway, or Max Shron to build your recommender for free, enter your email address here.

Recommender system

As a platform for working with data, we’ve seen users tackle lots of interesting use-cases: log analysis, natural language processing, pattern detection, and many more.

However, perhaps no use-case is in greater demand than recommender systems.  If you have more “inventory” than your users can easily find (whether it’s news, jobs, videos, restaurants, vacations, recipes, apps, etc.), a great recommender is crucial to driving engagement.

The problem is that recommender systems are really hard to implement, so most companies either don’t have one or aren’t happy with what they have.

What makes recommenders so tough?

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

There were two big and exciting releases this week from Hadoop vendors — Cloudera with Impala and MapR with M7. In addition, this week marks the 500th subscriber to Hadoop Weekly! Thanks everyone for subscribing, and please send anything my way that you think might make a good addition to this newsletter.

Technical
In the third and final part of his “Introduction to Hadoop” series, Tom White covers higher-level frameworks, anatomy of a Hadoop cluster, and data application pipelines. In terms of frameworks, he covers Pig, Hive, and Crunch (there’s a nice example of computing top-K with Crunch).
http://www.drdobbs.com/database/introduction-to-hadoop-real-world-hadoop/240153375/

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

The last full week of April was pretty busy for the Hadoop ecosystem — two core projects (Hadoop and HBase) saw releases, there was also some exciting funding news (congrats to Qubole!), and there were plenty of interesting technical articles.

Technical

The naming of components in Hadoop-related projects have often caused confusion (e.g. HDFS’ secondary namenode). Apache HBase is no exception — the HMaster is often misunderstood, because unlike its name suggests, not all writes go through the HMaster. This article elaborates on the role of the HMaster in HBase.
https://blogs.apache.org/hbase/entry/hbase_who_needs_a_master

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

There were a number of exciting announcements and releases this week (e.g. Hadoop on OpenStack, Impala 0.7) as well as some fantastic technical articles and tutorials. It’s great to see more technical articles about how folks are doing things with Hadoop — this week covering Hadoop internals, data formats, and MapReduce-based mobile UI customization. A big thanks to those that share their insights and experiences for making this newsletter possible!

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

This week’s newsletter features fewer releases than normal (let me know if I missed something!) but has a lot of interesting technical articles. In addition, I’m pleased to announce the return of an events section. Thanks to the folks at Mortar Data for curating this list! They’ve found a number of great Hadoop-related events taking place all over the world this week.

Technical
Apache Pig provides support for expressive SQL-like join operations. In this post, Matthew Rathbone shows how to implement a left-outer join in Pig and write a unit test to check for correctness. This is his third article that demos a framework — he previously covered MapReduce and Hive. This trifecta is quite an interesting comparison, so be sure to read all three if you missed the previous articles.
http://blog.matthewrathbone.com/2013/04/07/real-world-hadoop—-implementing-a-left-outer-join-in-pig.html

Read More

April 9, 2013

Our second NYC Data Science Meetup featured Tumblr data scientist Adam Laiacano, who discussed the analytics stack at Tumblr and the tools he and his team use to organize and analyze data. 

Here are the video and slides from Adam’s talk, which cover Tumblr’s use of Scribe, Hive & Pig, Hue, and Vowpal Wabbit:

Read More

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.
 
Happy 7th birthday to Apache Hadoop! The first release of Hadoop was made in April 2006. This week’s newsletter caps that anniversary by representing many parts of the Hadoop ecosystem. It’s quite impressive how far the project and the ecosystem have come in those 7 short years!

News
April 2nd marked the 7-year anniversary of the first release of Apache Hadoop. In this post, Doug Cutting (the founder of Hadoop) provides 7 thoughts and predictions for Hadoop. He touches everything from open-source, to the name of the project, to where he sees Hadoop heading in the next 7 years.
http://blog.cloudera.com/blog/2013/04/seven-thoughts-on-hadoops-seventh-birthday/

Read More

Hadoop Weekly is a new (recurring) guest post by Joe Crobak.  Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics.  You can follow Joe on Twitter at @joecrobak.

News
Apache Hadoop’s Distributed File System and MapReduce were originally based upon research papers written by Google. Google owns a number of patents in these spaces, including 10 related to MapReduce. This week, they pledged “not to sue any user, distributor or developer of open-source software on specified patents, unless first attacked.”
http://google-opensource.blogspot.com/2013/03/taking-stand-on-open-source-and-patents.html

Read More

Thanks to everyone who came out to our inaugural NYC Data Science Meetup.  For those who couldn’t attend, Hilary Mason fought off jetlag and a tough cold to give a great presentation.

Below is a 12-minute clip from Hilary’s talk, which she called “Dirty Secrets of Data Science.”

Read More

March 1, 2013

For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.

As the world has changed, so too has our data; an ever-increasing amount of data is now stored without a rigorous schema, or must be joined to outside data sets to be useful. Compounding this problem, often the amounts of data are so large that working with them on a traditional SQL database is so non-performant as to be impractical.

Enter Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop. (Hadoop is a massively parallel platform for processing the largest of data sets in reasonable amounts of time. Hadoop powers Facebook, Yahoo, Twitter, and LinkedIn, to name a few in a growing list.)

This then is a brief guide for the SQL developer diving into the waters of Pig Latin for the first time. Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.

Read More

Tags

February 14, 2013

You have MongoDB, so you have this tremendously scalable database. You’re collecting a ton of data, but you know you need to do more with it (okay, a lot more). You think you want to use Hadoop, but it doesn’t sound easy.

To keep it simple, we’ve divided the article into three parts:

  1. “WHY” explains the reasons for using Hadoop to process data stored in MongoDB
  2. “HOW” helps you get get set up
  3. “DEMO” shows you MongoDB and Hadoop working together. If you’re a tldr; type, you’ll want to start with this section.

Read More

New York’s data science community has been building since long before “data science” was used to describe it.  In addition to a long history of advertising and adtech companies, the recent startup explosion here in NYC has been largely led by companies built to leverage data science (including FourSquare, Tumblr, AppNexus, and Knewton, to name just a few).

Read More

Tags

Working with data is HARD.  Let’s face it, you’re brave to even attempt it, let alone make it your everyday job.

Fortunately, some incredibly talented people have taken the time to compile and share their deep knowledge for you.

Here are 7 books we recommend for picking up some new skills in 2013:

Read More