This weeks issues includes a number of posts covering the recently released Apache Spark 1.1, Apache Drill 0.5.0-incubating, and Apache Tez 0.5.0. In addition, there’s a look at Hadoop in the healthcare industry, a look at ORCFile for non-Hive workloads, instructions for building a Hadoop setup on Mac, and more. The amount of content this week shows that we’re past the summer lull, and I expect to see lots more great content this fall.
There were several releases in the Hadoop ecosystem this week, including Apache Hadoop 2.5.1 and Apache Spark 1.1.0. There’s a lot of interesting technical content, including testing HBase’s consistency with Jepsen and an in-depth look at an end-to-end big data infrastructure with Hadoop. On that node, there’s an interesting look into the growing demand for Data Engineers to build out Hadoop infrastructure.
At Mortar, we provide all our users with excellent security. And that includes customers working with sensitive data—lately, we’ve been hearing from more and more of them.
So we are pleased to announce that we now offer an Advanced Security package for customers with strict security compliance requirements or with unique business demands that require additional data protection.
While last week’s issue had posts covering a few common themes, this week’s issue has content for a wide number of topics. Those topics include: Spork (Pig on Spark), Hive (specifically the new Stinger.next initiative), and Presto. There is also some interesting news from established enterprise companies—Teradata has acquired Think Big Analytics, and Cisco has released management and monitoring software for Hadoop.
This week’s issue features a lot of good technical content covering Apache Storm and Apache Spark. There are also a number of releases—Apache Flink, Apache Phoenix, Cloudera Enterprise, and Luigi. In addition, Hortonworks announced a technical preview of Apache Kafka support for HDP, and SequenceIQ unveiled Periscope, an open-source tool for YARN cluster auto-scaling.
This week’s edition has a lot of great technical content from prominent Hadoop vendors Hortonworks and Cloudera as well as newcomer SequenceIQ. There are also a couple of interesting articles based on real-world experience covering an A/B testing platform and Apache Zookeeper. Those types of articles tend to be quite good but more difficult to find—as always, if you have suggestions for the newsletter please send them my way!
Today we have a huge announcement: Mortar is now free for accounts with up to three users.
Our mission at Mortar is to help data scientists and data engineers spend 100% of their time on problems that are specific to their business—and not on time-wasters like babysitting infrastructure, managing complex deploys, and rebuilding common algorithms from scratch. But for us to succeed at our mission, we need to make Mortar not just an amazing product, but also affordable for everyone.
If you’ve used Hadoop, you know that the overhead time necessary to provision and run small jobs can be painful. Most likely you kill time every time you test something by grabbing coffee, and pretty soon your hands are shaking from all that testing.
It doesn’t have to be like this. As of today you can run small jobs from Mortar in seconds. How? Choose to execute your job without a cluster, and we’ll skip provisioning and distributed computation—so you can get answers fast.
The big news this week was the Apache Hadoop 2.5.0 release. There are also a number of interesting technical articles covering the Apache Hadoop HDFS, Apache Drill, and several other ecosystem projects. Also, there’s an interesting post on profiling MapReduce jobs (which is typically quite challenging) with Reimann.
Open-source is key to everything we do at Mortar. Our award-winning platform would not be possible without Apache Hadoop or Pig, and it would not be as powerful without Lipstick (open-sourced by Netflix) or Luigi (Spotify).
So we’re always pleased when we can make a meaningful contribution back to the community by open-sourcing something of our own, such as when we extended Pig to work with Python. Now we’re adding more by open-sourcing our code for writing to DynamoDB.
We’re in the midst of a summer lull, so this week’s issue is shorter than usual. The lack of quantity is made up for in great quality, though. Technical posts cover YARN, HBase, Accumulo, and building an EMR-like local dev environment. There is also news on Actian, Adatao, Splice Machine, and the HP-Hortonworks strategic partnership. Hopefully there’s something for everyone!
The Hortonworks blog has a post on the ongoing work to improve the fault-tolerance of YARN’s ResourceManager (RM). This post describes phase two of the RM restart resiliency work, which aims to keep existing YARN application running during and after an RM reboot. The post covers the architecture of the solution, including which cluster state information is stored where.
Apache Ambari, which is cluster management software for Apache Hadoop, is a big topic this week. In addition to news that Pivotal and Hortonworks are teaming up to collaborate on Ambari, there are a string of technical articles on it. Another hot topic this week is Apache Spark—specifically its machine learning library, MLlib. Finally, it’s also worth noting that this week’s issue includes links to a few papers from the Proceedings of the VLDB Endowment.
The Databricks blog has a post showing how to write concise Spark (using the python library, pyspark) code to solve a complicated problem. Specifically, it show how to the Alternative Least Squares implementation of Spark’s machine learning library, MLlib, to build recommendations. The post also has some details on scaling the process across a 16-node cluster and how it compares to Apache Mahout’s implementation.
Two large pieces of news this week: HP and Hortonworks announced a $50 million investment in Hortonworks as part of an expanded partnership, and Apache Tez graduated from the Apache Incubator. Additionally, there were a number of interesting technical posts this week on Pig, MapR FS, SQL on Hadoop, HDFS, and more.
The Hortonworks blog has a post highlighting some of the new features of the recently released Apache Pig 0.13. The 0.13 release adds preliminary support for multiple backends (i.e. something other than MapReduce like Tez or Spark). The post talks about several new features, including new optimizations for small jobs, the ability to whitelist/blacklist certain operators, a user-level jar cache, and support for Apache Accumulo.
This week is full of releases and new products—ranging from Oracle’s new Hadoop-SQL product to a new CDH 5.1 release from Cloudera to new tools for transactions on HBase from Continuuity and deploying Hadoop-as-a-Service from SequenceIQ. There are also a number of quality technical articles covering Spark, Kafka, Luigi, and Hive.
This post covers using the Transformer class to manipulate data as it flows into Sqrrl Enterprise. It details loading the enron email dataset and using a Transformer to build a graph of users sending email. It includes the code for thisTransformer and also some examples of querying the dataset using tools found in Sqrrl Enterprise.
This week was fairly low-volume (at least in recent memory), but there are some good technical articles covering Hive, the Kite SDK, Oozie, and more. Also, the videos from HBaseCon were posted, and there were a number of ecosystem project releases.
The Pivotal blog has a post on setting up Pivotal HD, HAWQ (for data warehousing) and GemFire XD (for in-memory data grid) inside of VMs using Vagrant. The four node virtual cluster is setup with a single command, and the blog has more info on the configuration and the tools installed as part of the setup.