This week’s issue features a lot of good technical content covering Apache Storm and Apache Spark. There are also a number of releases—Apache Flink, Apache Phoenix, Cloudera Enterprise, and Luigi. In addition, Hortonworks announced a technical preview of Apache Kafka support for HDP, and SequenceIQ unveiled Periscope, an open-source tool for YARN cluster auto-scaling.
This week’s edition has a lot of great technical content from prominent Hadoop vendors Hortonworks and Cloudera as well as newcomer SequenceIQ. There are also a couple of interesting articles based on real-world experience covering an A/B testing platform and Apache Zookeeper. Those types of articles tend to be quite good but more difficult to find—as always, if you have suggestions for the newsletter please send them my way!
Today we have a huge announcement: Mortar is now free for accounts with up to three users.
Our mission at Mortar is to help data scientists and data engineers spend 100% of their time on problems that are specific to their business—and not on time-wasters like babysitting infrastructure, managing complex deploys, and rebuilding common algorithms from scratch. But for us to succeed at our mission, we need to make Mortar not just an amazing product, but also affordable for everyone.
If you’ve used Hadoop, you know that the overhead time necessary to provision and run small jobs can be painful. Most likely you kill time every time you test something by grabbing coffee, and pretty soon your hands are shaking from all that testing.
It doesn’t have to be like this. As of today you can run small jobs from Mortar in seconds. How? Choose to execute your job without a cluster, and we’ll skip provisioning and distributed computation—so you can get answers fast.
The big news this week was the Apache Hadoop 2.5.0 release. There are also a number of interesting technical articles covering the Apache Hadoop HDFS, Apache Drill, and several other ecosystem projects. Also, there’s an interesting post on profiling MapReduce jobs (which is typically quite challenging) with Reimann.
Open-source is key to everything we do at Mortar. Our award-winning platform would not be possible without Apache Hadoop or Pig, and it would not be as powerful without Lipstick (open-sourced by Netflix) or Luigi (Spotify).
So we’re always pleased when we can make a meaningful contribution back to the community by open-sourcing something of our own, such as when we extended Pig to work with Python. Now we’re adding more by open-sourcing our code for writing to DynamoDB.
We’re in the midst of a summer lull, so this week’s issue is shorter than usual. The lack of quantity is made up for in great quality, though. Technical posts cover YARN, HBase, Accumulo, and building an EMR-like local dev environment. There is also news on Actian, Adatao, Splice Machine, and the HP-Hortonworks strategic partnership. Hopefully there’s something for everyone!
The Hortonworks blog has a post on the ongoing work to improve the fault-tolerance of YARN’s ResourceManager (RM). This post describes phase two of the RM restart resiliency work, which aims to keep existing YARN application running during and after an RM reboot. The post covers the architecture of the solution, including which cluster state information is stored where.
Apache Ambari, which is cluster management software for Apache Hadoop, is a big topic this week. In addition to news that Pivotal and Hortonworks are teaming up to collaborate on Ambari, there are a string of technical articles on it. Another hot topic this week is Apache Spark—specifically its machine learning library, MLlib. Finally, it’s also worth noting that this week’s issue includes links to a few papers from the Proceedings of the VLDB Endowment.
The Databricks blog has a post showing how to write concise Spark (using the python library, pyspark) code to solve a complicated problem. Specifically, it show how to the Alternative Least Squares implementation of Spark’s machine learning library, MLlib, to build recommendations. The post also has some details on scaling the process across a 16-node cluster and how it compares to Apache Mahout’s implementation.
Two large pieces of news this week: HP and Hortonworks announced a $50 million investment in Hortonworks as part of an expanded partnership, and Apache Tez graduated from the Apache Incubator. Additionally, there were a number of interesting technical posts this week on Pig, MapR FS, SQL on Hadoop, HDFS, and more.
The Hortonworks blog has a post highlighting some of the new features of the recently released Apache Pig 0.13. The 0.13 release adds preliminary support for multiple backends (i.e. something other than MapReduce like Tez or Spark). The post talks about several new features, including new optimizations for small jobs, the ability to whitelist/blacklist certain operators, a user-level jar cache, and support for Apache Accumulo.
This week is full of releases and new products—ranging from Oracle’s new Hadoop-SQL product to a new CDH 5.1 release from Cloudera to new tools for transactions on HBase from Continuuity and deploying Hadoop-as-a-Service from SequenceIQ. There are also a number of quality technical articles covering Spark, Kafka, Luigi, and Hive.
This post covers using the Transformer class to manipulate data as it flows into Sqrrl Enterprise. It details loading the enron email dataset and using a Transformer to build a graph of users sending email. It includes the code for thisTransformer and also some examples of querying the dataset using tools found in Sqrrl Enterprise.
This week was fairly low-volume (at least in recent memory), but there are some good technical articles covering Hive, the Kite SDK, Oozie, and more. Also, the videos from HBaseCon were posted, and there were a number of ecosystem project releases.
The Pivotal blog has a post on setting up Pivotal HD, HAWQ (for data warehousing) and GemFire XD (for in-memory data grid) inside of VMs using Vagrant. The four node virtual cluster is setup with a single command, and the blog has more info on the configuration and the tools installed as part of the setup.
Recently data strategist Max Shron of Polynumeral spoke to the NYC Data Science Meetup about techniques for improving focus, communication, and results in data science campaigns, including the CoNVO process detailed in his book Thinking with Data. It was a terrific and extremely practical talk, a video of which is embedded below. (If you’re pressed for time, you can use our time-stamped summary to skip to specific sections of the talk.)
Here’s the summary of Max’s talk, with video, slides, and the full transcript below:
I was expecting a dearth of content to match the short week in the US for July 4th. But with Spark Summit this week in San Francisco, there were a number of partnerships, new tools, and other announcements. Both Databricks and MapR announced influxes of cash this week, and there was a lot of discussion about the future of Hive given a joint announcement by Cloudera, Databricks, IBM, Intel, and MapR to build a new Spark backend for Hive. In addition to that, Apache Hadoop 2.4.1 was released, Apache Pig 0.13.0 was released, and Flambo, a new clojure DSL for Spark was unveiled.
Pivotal HD and HAWQ support Parquet field natively in HDFS. This tutorial shows how to build a parquet-backed table with HAWQ and then access the data stored in HDFS using Apache Pig.
Google made news this week by proclaiming that MapReduce is dead at Google—there are two reactions in this week’s issue. And with that in mind, there are several good posts covering non-MapReduce projects in the Hadoop ecosystem—Accumulo, HDFS, Storm, Spark, and more. Apache Storm also released a new version this week, and there were announcements from Hortonworks, IBM, and RainStor about their Hadoop-related products.
Apache Accumulo, the distributed key-value store, supports bulk loading of data in its native format, RFile. Loading data as RFiles, which can be generated via MapReduce jobs, is much more efficient than loading the same data one record at a time. The Sqrrl blog talks about some tools they’ve built to load data using RFiles from data stored in JSON and CSV.
Security in Hadoop is a big topic in this week’s issue—there’s coverage from Accumulo Summit, and posts from both Cloudera and Hortonworks on the topic. This week’s issue also covers technical posts on the Kiji Framework, Hadoop + Docker, and Etsy’s predictive modeling software, Conjecture.
Slides from the recent Accumulo Summit have been posted online. There are 17 presentations from folks at Cloudera, Hortonworks, Sqrrl, and more. Topics include Accumulo and YARN, the Accumulo community, and security for Accumulo.