image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Two large pieces of news this week: HP and Hortonworks announced a $50 million investment in Hortonworks as part of an expanded partnership, and Apache Tez graduated from the Apache Incubator. Additionally, there were a number of interesting technical posts this week on Pig, MapR FS, SQL on Hadoop, HDFS, and more.

Technical
The Hortonworks blog has a post highlighting some of the new features of the recently released Apache Pig 0.13. The 0.13 release adds preliminary support for multiple backends (i.e. something other than MapReduce like Tez or Spark). The post talks about several new features, including new optimizations for small jobs, the ability to whitelist/blacklist certain operators, a user-level jar cache, and support for Apache Accumulo.
http://hortonworks.com/blog/announcing-apache-pig-0-13-0-usability-improvements-progress-toward-pig-tez/

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

This week is full of releases and new products—ranging from Oracle’s new Hadoop-SQL product to a new CDH 5.1 release from Cloudera to new tools for transactions on HBase from Continuuity and deploying Hadoop-as-a-Service from SequenceIQ. There are also a number of quality technical articles covering Spark, Kafka, Luigi, and Hive.

Technical
This post covers using the Transformer class to manipulate data as it flows into Sqrrl Enterprise. It details loading the enron email dataset and using a Transformer to build a graph of users sending email.  It includes the code for thisTransformer and also some examples of querying the dataset using tools found in Sqrrl Enterprise.
http://blog.sqrrl.com/bulk-loading-in-sqrrl-pt.2-custom-transformers-for-graph-construction

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

This week was fairly low-volume (at least in recent memory), but there are some good technical articles covering Hive, the Kite SDK, Oozie, and more. Also, the videos from HBaseCon were posted, and there were a number of ecosystem project releases.

Technical
The Pivotal blog has a post on setting up Pivotal HD, HAWQ (for data warehousing) and GemFire XD (for in-memory data grid) inside of VMs using Vagrant. The four node virtual cluster is setup with a single command, and the blog has more info on the configuration and the tools installed as part of the setup.
http://blog.gopivotal.com/pivotal/products/1-command-15-minute-install-hadoop-in-memory-data-grid-sql-analytic-data-warehouse

Read More

John Matson

image

Recently data strategist Max Shron of Polynumeral spoke to the NYC Data Science Meetup about techniques for improving focus, communication, and results in data science campaigns, including the CoNVO process detailed in his book Thinking with Data. It was a terrific and extremely practical talk, a video of which is embedded below. (If you’re pressed for time, you can use our time-stamped summary to skip to specific sections of the talk.)

Here’s the summary of Max’s talk, with video, slides, and the full transcript below:

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

I was expecting a dearth of content to match the short week in the US for July 4th. But with Spark Summit this week in San Francisco, there were a number of partnerships, new tools, and other announcements. Both Databricks and MapR announced influxes of cash this week, and there was a lot of discussion about the future of Hive given a joint announcement by Cloudera, Databricks, IBM, Intel, and MapR to build a new Spark backend for Hive. In addition to that, Apache Hadoop 2.4.1 was released, Apache Pig 0.13.0 was released, and Flambo, a new clojure DSL for Spark was unveiled.

Technical
Pivotal HD and HAWQ support Parquet field natively in HDFS. This tutorial shows how to build a parquet-backed table with HAWQ and then access the data stored in HDFS using Apache Pig.
http://www.pivotalguru.com/?p=727

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Google made news this week by proclaiming that MapReduce is dead at Google—there are two reactions in this week’s issue. And with that in mind, there are several good posts covering non-MapReduce projects in the Hadoop ecosystem—Accumulo, HDFS, Storm, Spark, and more. Apache Storm also released a new version this week, and there were announcements from Hortonworks, IBM, and RainStor about their Hadoop-related products.

Technical
Apache Accumulo, the distributed key-value store, supports bulk loading of data in its native format, RFile. Loading data as RFiles, which can be generated via MapReduce jobs, is much more efficient than loading the same data one record at a time. The Sqrrl blog talks about some tools they’ve built to load data using RFiles from data stored in JSON and CSV.
http://sqrrl.com/bulk-loading-sqrrl-pt-1-basics/

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Security in Hadoop is a big topic in this week’s issue—there’s coverage from Accumulo Summit, and posts from both Cloudera and Hortonworks on the topic. This week’s issue also covers technical posts on the Kiji Framework, Hadoop + Docker, and Etsy’s predictive modeling software, Conjecture.

Technical
Slides from the recent Accumulo Summit have been posted online. There are 17 presentations from folks at Cloudera, Hortonworks, Sqrrl, and more. Topics include Accumulo and YARN, the Accumulo community, and security for Accumulo.
http://www.slideshare.net/AccumuloSummit

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

With Hadoop Summit in recent memory, there are several posts from or summarizing the summit in this week’s newsletter. Technical articles cover a wide range of topics from Hive and Pig tips to logging infrastructure at Loggly. SQL-on-Hadoop was also a big topic this week—discussions about the need for it to drive Hadoop adoption.

Technical
The Mortar blog has a post with some tips for using Apache Pig. It features some lesser-known features of Pig such as writing UDFs in JavaScript, data sampling, and casting a relation to a scalar. If you use Pig and are looking to level-up your game, this is a great place to start.
http://blog.mortardata.com/post/88485590701/13-things-you-didnt-know-you-could-do-with-pig

Read More

Cat Miller and Jeremy Karn

imageimage

You may not know it, but Pig lives all around you. LinkedIn, Twitter, Netflix, Salesforce… These internet giants (and many others) all use Apache Pig to help make sense of the massive amounts of data they generate on a daily basis.

It’s relatively well known that Pig is great for working with unstructured data (Pigs Eat Anything, per the official Apache Pig Philosophy), that it’s flexible and extensible (Pigs Are Domestic Animals), and that it sails through massive data sets with ease (Pigs Fly). That’s all true, but we’ve also stumbled onto several cool features of Pig that aren’t as well known. We compiled the list below to share some of the Piggy goodness.

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Hadoop Summit was this week in San Jose, so this week’s newsletter is full of lots of interesting technical content and news. I tried to capture as much as I could, but there is just an overwhelming amount. Enjoy!

Technical
Hivemall is a machine learning tool for Apache Hive. Implemented as Hive UDFs, it’s easy to test out (just add the jar to a hive session), contains a number of machine learning algorithm implementations (including several not found in other Hadoop libraries), and can do iteration without multiple MapReduce jobs. These slides from a talk at Hadoop Summit provide many more details including details on the implementation
http://www.slideshare.net/myui/hivemall-hadoop-summit-2014-san-jose

Read More

Doug Daniels

image

Fulfilling a user request for a new feature always feels good. Implementing your users’ two most-requested features at once? Well, that just feels fantastic.

Today we’re announcing two huge new features: Global Availability for Mortar and Mortar on Your AWS Account. Bringing these together, you can run our award-winning platform against your EMR clusters in your AWS account anywhere on the planet.

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Apache Spark 1.0 was released this week. And there are a number of posts this week about Spark, including a post describing how eBay is starting to use Spark. This week is Hadoop Summit in San Jose, and there’s some anticipation building including two posts on the Hortonworks blog about Discardable Memory and Materialized Queries that will be presented at the summit. I’m sure there will be a lot of great presentations; please forward them my way as I won’t be attending.

Technical
The SequenceIQ blog has a post about building a command-line tool for Apache Ambari using Spring Shell and the Ambari REST API. The code for ambari-shell is available on github, there’s a binary available as a docker image, and ambari-shell is slated for inclusion in Ambari 1.6.1. The post has a tour of the features of the tool
http://blog.sequenceiq.com/blog/2014/05/26/ambari-shell/

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Articles in this week’s newsletter cover a couple of themes that have been emerging recently in the Hadoop ecosystem. First, Apache Storm continues to see adoption for production workloads (whereas I’ve yet to see many serious deployments of newer tools like Spark streaming). Second, Hadoop in the cloud is starting to gain traction (and will likely accelerate as light-weight virtualization and the cloud price wars take off). There are a lot of good articles covering these topics and more in this week’s issue.

Technical
kafka-storm-starter is a repository containing an example integration between Kafka and Storm for stream processing. It uses Avro for serialization, and the code base contains both Kafka and Storm standalone code examples, example unit tests, and example integration tests. The README has a lot of details on the implementation, on setting up a development environment, and much more
https://github.com/miguno/kafka-storm-starter

Read More

May 23, 2014

K Young

image

People Love Redshift
People love Redshift because it nailed the tech-trifecta: it’s cheap, it delivers, and it’s available instantly with zero commitment.

If you’re not familiar with Redshift, it is AWS’s on-demand data warehouse. Data warehouses are for large-scale reporting and data analysis, and are crucial to most sizable businesses. Redshift’s competitors have excellent products, but they cost 10-100x more money, and sales and procurement take months and lock you in for years.

There’s Just One Problem: Loading
But as anyone who’s tried to use Redshift knows, there’s one glaring problem: it’s a huge pain to load your data into Redshift in the first place. Or it was before today.

Read More

image

Hadoop Weekly is a recurring guest post by Joe Crobak. Joe is a software engineer focused on Hadoop and analytics. You can follow Joe on Twitter at @joecrobak.

Yahoo announced their support for Hive and Tez this week in the widely contested SQL-on-Hadoop market. Meanwhile, there is an interesting overview of a real-world use-case at Allstate with Cloudera’s SQL-on-Hadoop system Impala. There are also plenty of interesting technical articles and exciting announcements—including the public availability of Splice Machine’s RDBMS on HBase product and a native implementation of MapReduce that’s been open-sourced by Intel.

Technical
The Cloudera blog has an article about a change in the way that Oozie manages its shared library directory in HDFS. The changes add support for multiple versions of the directory which fix a race condition. The post explains the changes and tooling around it
http://blog.cloudera.com/blog/2014/05/how-to-use-the-sharelib-in-apache-oozie-cdh-5/

Read More