<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:dc="http://purl.org/dc/elements/1.1/" version="2.0"><channel><atom:link rel="hub" href="http://tumblr.superfeedr.com/" xmlns:atom="http://www.w3.org/2005/Atom"/><description>Hadoop, pig, python, big data, data development, startups</description><title>Mortar Data Blog</title><generator>Tumblr (3.0; @mortardata)</generator><link>http://blog.mortardata.com/</link><item><title>Hadoop Weekly - May 20, 2013</title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;span&gt;&lt;/span&gt;&lt;a href="http://mortardata.com/team" target="_self"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_self"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p&gt;&lt;em&gt;Hadoop Weekly is a recurring guest post by &lt;a href="http://www.crobak.org/" target="_blank"&gt;Joe Crobak&lt;/a&gt;&lt;/em&gt;&lt;em&gt;. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at &lt;a href="https://twitter.com/joecrobak" target="_blank"&gt;@joecrobak&lt;/a&gt;.&lt;/em&gt;&lt;br/&gt;&lt;br/&gt; Both Apache Hadoop and Apache Hive crowned new releases this week, and there are a number of interesting technical articles covering YARN, NFS access to HDFS, and Apache Flume. With so much happening so quickly in the Hadoop-ecosystem, it can be a difficult to keep up &amp;#8212; so please let me know if I missed anything, and I&amp;#8217;ll include it next week.&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Technical&lt;/strong&gt;&lt;br/&gt; Apache HDFS is getting support for the Network FileSystem (NFS) protocol. This an exciting new feature, and one of the authors working on the feature details the what, why, how, and when of Hadoop&amp;#8217;s NFS support, which is being developed in trunk.&lt;br/&gt;&lt;a href="http://hortonworks.com/blog/simplifying-data-management-nfs-access-to-hdfs/" target="_blank"&gt;&lt;a href="http://hortonworks.com/blog/simplifying-data-management-nfs-access-to-hdfs/" target="_blank"&gt;http://hortonworks.com/blog/simplifying-data-management-nfs-access-to-hdfs/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt; &lt;!-- more --&gt;&lt;br/&gt; Cloudera&amp;#8217;s blog has the second in their &amp;#8220;meet the founders&amp;#8221; series. This post features Roman Shaposhnik who founded and works on Apache BigTop.  Aside from having one of the best names of projects in the Hadoop ecosystem, BigTop is beginning to have a lot of influence in making sure that components in the stack are compatible with one another when released.&lt;br/&gt;&lt;a href="http://blog.cloudera.com/blog/2013/05/meet-the-project-founder-roman-shaposhnik/" target="_blank"&gt;&lt;a href="http://blog.cloudera.com/blog/2013/05/meet-the-project-founder-roman-shaposhnik/" target="_blank"&gt;http://blog.cloudera.com/blog/2013/05/meet-the-project-founder-roman-shaposhnik/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; If you&amp;#8217;ve ever tried to put together a patch for Hadoop, it can be very intimidating (and a slow process) just to configure your development environment. This post provides an overview of setting up Eclipse for developing Hadoop &amp;#8212; covering all the major versions and flavors under development.&lt;br/&gt;&lt;a href="http://blog.cloudera.com/blog/2013/05/how-to-configure-eclipse-for-hadoop-contributions/" target="_blank"&gt;&lt;a href="http://blog.cloudera.com/blog/2013/05/how-to-configure-eclipse-for-hadoop-contributions/" target="_blank"&gt;http://blog.cloudera.com/blog/2013/05/how-to-configure-eclipse-for-hadoop-contributions/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; You might remember the &amp;#8220;stinger initiative&amp;#8221; which was introduced a while back by Hortonworks with the goal of making Hive 100x faster. With the release of Hive 0.11 (more below), they summarize some of the work that&amp;#8217;s already been done towards this goal, as well as some of the new features in Hive 0.11 (such as RANK and other analytical functions).&lt;br/&gt;&lt;a href="http://hortonworks.com/blog/apache-hive-0-11-stinger-phase-1-delivered/" target="_blank"&gt;&lt;a href="http://hortonworks.com/blog/apache-hive-0-11-stinger-phase-1-delivered/" target="_blank"&gt;http://hortonworks.com/blog/apache-hive-0-11-stinger-phase-1-delivered/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; The Manning Early Access Program (MEAP) is now available for the new book, &amp;#8220;Pig in Action&amp;#8221;, by M. Tim Jones. With MEAP, you pre-order the book but get access to the content as the author is writing and uploading it.&lt;br/&gt;&lt;a href="http://www.manning.com/tjones/" target="_blank"&gt;&lt;a href="http://www.manning.com/tjones/" target="_blank"&gt;http://www.manning.com/tjones/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Apache Flume is a system for transferring data from application servers or other event-generators to HDFS or HBase. In this post, the author gives an overview of the Flume architecture &amp;#8212; both at the component and system scale.&lt;br/&gt;&lt;a href="http://www.drdobbs.com/database/acquiring-big-data-using-apache-flume/240155029" target="_blank"&gt;&lt;a href="http://www.drdobbs.com/database/acquiring-big-data-using-apache-flume/240155029" target="_blank"&gt;http://www.drdobbs.com/database/acquiring-big-data-using-apache-flume/240155029&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; The Natural Language Toolkit (NLTK) is a set of Python libraries for natural language processing. This post describes how to tie them to Hadoop MapReduce for parallel processing using Hadoop Streaming.&lt;br/&gt;&lt;a href="http://datacommunitydc.org/blog/2013/05/nltk-hadoop/" target="_blank"&gt;&lt;a href="http://datacommunitydc.org/blog/2013/05/nltk-hadoop/" target="_blank"&gt;http://datacommunitydc.org/blog/2013/05/nltk-hadoop/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Arun C. Murthy, one of the leads on the Apache Hadoop YARN project, gives an update on the progress of the project plus background on what YARN can enable. In particular, YARN turns Hadoop into a multi-application system, allowing more than just MapReduce to run on Hadoop. Arun highlights that we&amp;#8217;ll be able to run SQL in Hadoop rather than SQL on Hadoop (via MapReduce).&lt;br/&gt;&lt;a href="http://hortonworks.com/blog/moving-hadoop-beyond-batch-with-apache-yarn/" target="_blank"&gt;&lt;a href="http://hortonworks.com/blog/moving-hadoop-beyond-batch-with-apache-yarn/" target="_blank"&gt;http://hortonworks.com/blog/moving-hadoop-beyond-batch-with-apache-yarn/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Hortonworks has compiled a list of links for the Hadoop on Windows developer. In particular, the .NET SDK, the Microsoft Hive ODBC driver, and HDInsight&amp;#8217;s Preview (Hadoop on Azure).&lt;br/&gt;&lt;a href="http://hortonworks.com/blog/hadoop-sdk-and-tutorials-for-microsoft-net-developers/" target="_blank"&gt;&lt;a href="http://hortonworks.com/blog/hadoop-sdk-and-tutorials-for-microsoft-net-developers/" target="_blank"&gt;http://hortonworks.com/blog/hadoop-sdk-and-tutorials-for-microsoft-net-developers/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Platfora&amp;#8217;s product provides a combines a UI and low-latency data store to do interactive analysis on data stored in S3 or HDFS. If the data isn&amp;#8217;t already in Platfora&amp;#8217;s store, the system can generate a MapReduce job to load the data.  This article gives a good overview of how all of the technology components in the Platfora system work together.&lt;br/&gt;&lt;a href="http://cloudcomputing.sys-con.com/node/2663726" target="_blank"&gt;&lt;a href="http://cloudcomputing.sys-con.com/node/2663726" target="_blank"&gt;http://cloudcomputing.sys-con.com/node/2663726&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; HUE provides a UI for interacting with Hadoop, Hive, Pig, and more. This post describes how to leverage HUE&amp;#8217;s python API to execute queries against Hive (via HiveServer2) or Impala (which must implement the same Thrift API).&lt;br/&gt;&lt;a href="http://gethue.tumblr.com/post/49882746559/executing-hive-or-impala-queries-with-python" target="_blank"&gt;&lt;a href="http://gethue.tumblr.com/post/49882746559/executing-hive-or-impala-queries-with-python" target="_blank"&gt;http://gethue.tumblr.com/post/49882746559/executing-hive-or-impala-queries-with-python&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Storm is sometimes called the real-time version of MapReduce. With a lot of interest in getting Storm running on YARN, now&amp;#8217;s a good time to get familiar with the system. The inaugural London Storm Meetup featured an overview of Storm as well as a discussion of the presenter&amp;#8217;s use-case. This post has a summary of the event, including links out to the presentation and code examples.&lt;br/&gt;&lt;a href="http://partners.peerindex.com/london-storm-meetup-big-data/" target="_blank"&gt;&lt;a href="http://partners.peerindex.com/london-storm-meetup-big-data/" target="_blank"&gt;http://partners.peerindex.com/london-storm-meetup-big-data/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;News&lt;/strong&gt;&lt;br/&gt; Contexti and MapR have joined forces to provide training, consulting, and professional services for MapR&amp;#8217;s distribution in Asia-Pacific.&lt;br/&gt;&lt;a href="http://www.businesswire.com/news/home/20130512005046/en/Contexti-Expands-Hadoop-NoSQL-Portfolio-MapR" target="_blank"&gt;&lt;a href="http://www.businesswire.com/news/home/20130512005046/en/Contexti-Expands-Hadoop-NoSQL-Portfolio-MapR" target="_blank"&gt;http://www.businesswire.com/news/home/20130512005046/en/Contexti-Expands-Hadoop-NoSQL-Portfolio-MapR&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Drawn to Scale, a SQL-on-Hadoop vendor, has announced that they&amp;#8217;re closing their doors. They had an interesting system, which is built to be performant on many types of SQL operations, and they even had a compatibility layer for MongoDB. It should be interesting to see what happens to that team and their technology.&lt;br/&gt;&lt;a href="http://www.roadtofailure.com/?p=11" target="_blank"&gt;&lt;a href="http://www.roadtofailure.com/?p=11" target="_blank"&gt;http://www.roadtofailure.com/?p=11&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Concurrent and MapR announced that Concurrent&amp;#8217;s Cascading framework is now certified to run on MapR&amp;#8217;s distribution.&lt;br/&gt;&lt;a href="http://www.concurrentinc.com/posts/2013/05/15/concurrent-inc-partners-with-mapr-technologies-to-drive-mass-enterprise-hadoop-adoption/" target="_blank"&gt;&lt;a href="http://www.concurrentinc.com/posts/2013/05/15/concurrent-inc-partners-with-mapr-technologies-to-drive-mass-enterprise-hadoop-adoption/" target="_blank"&gt;http://www.concurrentinc.com/posts/2013/05/15/concurrent-inc-partners-with-mapr-technologies-to-drive-mass-enterprise-hadoop-adoption/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt; Releases&lt;/strong&gt;&lt;br/&gt; Hadoop 1.2.0 featuring DistCP v2 backport, web services for the JobTracker, the offline image viewer, and a bunch of other enhancements.&lt;br/&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/hadoop-general/201305.mbox/%3CCA%2Bz3%2B9Er-fx6XwZ%3DrefL1aa70qSKKREhBc3Rz0XP3aSOhaVh6w%40mail.gmail.com%3E" target="_blank"&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/hadoop-general/201305.mbox/%3CCA%2Bz3%2B9Er-fx6XwZ%3DrefL1aa70qSKKREhBc3Rz0XP3aSOhaVh6w%40mail.gmail.com%3E" target="_blank"&gt;http://mail-archives.apache.org/mod_mbox/hadoop-general/201305.mbox/%3CCA%2Bz3%2B9Er-fx6XwZ%3DrefL1aa70qSKKREhBc3Rz0XP3aSOhaVh6w%40mail.gmail.com%3E&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; WibiData announced Albacore/BentoBox v1.0.4. This version has some new features, including a whole new component &amp;#8212; KijiREST, which provides a REST interface to KijiSchema.&lt;br/&gt;&lt;a href="http://www.kiji.org/announcing-the-albacore-bentobox-v1.0.4/" target="_blank"&gt;&lt;a href="http://www.kiji.org/announcing-the-albacore-bentobox-v1.0.4/" target="_blank"&gt;http://www.kiji.org/announcing-the-albacore-bentobox-v1.0.4/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Hive 0.11 was released with over 350 Jira issues closed. This is the first release since HCatalog was integrated as a subproject of Hive, and it has a bunch of new features such as HiveServer2, ORCFile, and analytics functions.&lt;br/&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/hive-user/201305.mbox/%3CCAHfHakGC9pkVV5V_oZBB4kzB_nQ2RRibZ9cqSGPMTp2Qe3%2BABw%40mail.gmail.com%3E" target="_blank"&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/hive-user/201305.mbox/%3CCAHfHakGC9pkVV5V_oZBB4kzB_nQ2RRibZ9cqSGPMTp2Qe3%2BABw%40mail.gmail.com%3E" target="_blank"&gt;http://mail-archives.apache.org/mod_mbox/hive-user/201305.mbox/%3CCAHfHakGC9pkVV5V_oZBB4kzB_nQ2RRibZ9cqSGPMTp2Qe3%2BABw%40mail.gmail.com%3E&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Talend&amp;#8217;s Open Studio was updated to version 5.3 a few weeks ago. This post has a quick overview of the new features, which include a new integration with Apache Pig, as well as support for Amazon&amp;#8217;s Elastic MapReduce and RedShift.&lt;br/&gt;&lt;a href="http://www.h-online.com/open/news/item/Talend-5-3-focused-on-Hadoop-usability-1864844.html" target="_blank"&gt;&lt;a href="http://www.h-online.com/open/news/item/Talend-5-3-focused-on-Hadoop-usability-1864844.html" target="_blank"&gt;http://www.h-online.com/open/news/item/Talend-5-3-focused-on-Hadoop-usability-1864844.html&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt; Events (curated by Mortar Data)&lt;/strong&gt;&lt;br/&gt; Monday, May 20&lt;br/&gt; MySQL to Cassandra: Big Data, High Scale, Data Migration&amp;#8230; Oh My! (New York, NY)&lt;br/&gt;&lt;a href="http://www.meetup.com/mysqlnyc/events/114879742/" target="_blank"&gt;&lt;a href="http://www.meetup.com/mysqlnyc/events/114879742/" target="_blank"&gt;http://www.meetup.com/mysqlnyc/events/114879742/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Monday, May 20&lt;br/&gt; Automating the Hadoop Stack (Los Angeles, CA)&lt;br/&gt;&lt;a href="http://www.meetup.com/LA-HUG/events/117428702/" target="_blank"&gt;&lt;a href="http://www.meetup.com/LA-HUG/events/117428702/" target="_blank"&gt;http://www.meetup.com/LA-HUG/events/117428702/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Tuesday, May 21&lt;br/&gt; Data &amp;amp; Drinks - Member Networking Meetup (New York, NY)&lt;br/&gt;&lt;a href="http://www.meetup.com/Analytics-and-Data-in-Financial-Services/events/112520472/" target="_blank"&gt;&lt;a href="http://www.meetup.com/Analytics-and-Data-in-Financial-Services/events/112520472/" target="_blank"&gt;http://www.meetup.com/Analytics-and-Data-in-Financial-Services/events/112520472/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Tuesday, May 21&lt;br/&gt; Recommendation Engines &amp;amp; Accumulo (Denver, CO)&lt;br/&gt;&lt;a href="http://www.meetup.com/Data-Science-Business-Analytics/events/116790372/" target="_blank"&gt;&lt;a href="http://www.meetup.com/Data-Science-Business-Analytics/events/116790372/" target="_blank"&gt;http://www.meetup.com/Data-Science-Business-Analytics/events/116790372/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Tuesday, May 21&lt;br/&gt; Thoughts on machine learning (New York, NY)&lt;br/&gt;&lt;a href="http://www.meetup.com/NYC-Machine-Learning/events/119204802/" target="_blank"&gt;&lt;a href="http://www.meetup.com/NYC-Machine-Learning/events/119204802/" target="_blank"&gt;http://www.meetup.com/NYC-Machine-Learning/events/119204802/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Tuesday, May 21&lt;br/&gt; How we use Scala on Hadoop @ eBay (New York, NY)&lt;br/&gt;&lt;a href="http://www.meetup.com/ny-scala/events/113168872/" target="_blank"&gt;&lt;a href="http://www.meetup.com/ny-scala/events/113168872/" target="_blank"&gt;http://www.meetup.com/ny-scala/events/113168872/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Wednesday, May 22&lt;br/&gt; Cloudera Impala: An Open Source Real-Time Query Engine for Apache Hadoop (Boulder, CO)&lt;br/&gt;&lt;a href="http://www.meetup.com/Boulder-Denver-Big-Data/events/114501572/" target="_blank"&gt;&lt;a href="http://www.meetup.com/Boulder-Denver-Big-Data/events/114501572/" target="_blank"&gt;http://www.meetup.com/Boulder-Denver-Big-Data/events/114501572/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Wednesday, May 22&lt;br/&gt; Big Data, NoSQL, Now What? (New York, NY)&lt;br/&gt;&lt;a href="http://www.meetup.com/mysqlnyc/events/114883642/" target="_blank"&gt;&lt;a href="http://www.meetup.com/mysqlnyc/events/114883642/" target="_blank"&gt;http://www.meetup.com/mysqlnyc/events/114883642/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Saturday, May 25&lt;br/&gt; Big Data Science Meetup Event (Fremont, CA)&lt;br/&gt;&lt;a href="http://www.meetup.com/Big-Data-Science/events/71084702/" target="_blank"&gt;&lt;a href="http://www.meetup.com/Big-Data-Science/events/71084702/" target="_blank"&gt;http://www.meetup.com/Big-Data-Science/events/71084702/&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</description><link>http://blog.mortardata.com/post/50904582262</link><guid>http://blog.mortardata.com/post/50904582262</guid><pubDate>Mon, 20 May 2013 09:10:07 -0400</pubDate><category>Hadoop Weekly</category><category>hadoop</category></item><item><title>Hadoop Weekly - May 13, 2013</title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;span&gt;&lt;/span&gt;&lt;a href="http://mortardata.com/team" target="_self"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_self"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p&gt;&lt;em&gt;Hadoop Weekly is a recurring guest post by &lt;a href="http://www.crobak.org/" target="_blank"&gt;Joe Crobak&lt;/a&gt;&lt;/em&gt;&lt;em&gt;. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at &lt;a href="https://twitter.com/joecrobak" target="_blank"&gt;@joecrobak&lt;/a&gt;.&lt;/em&gt;&lt;br/&gt;&lt;br/&gt; This week&amp;#8217;s newsletter is a little lighter than normal in technical news (some fascinating articles, though!), but there are a quite a few interesting releases and upcoming events. Hope you enjoy, and please let me know if you find anything that I missed! Also, thanks to everyone that has been spreading the word about this newsletter &amp;#8212; the number of new subscribers each week has been really encouraging.&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Technical&lt;/strong&gt;&lt;br/&gt; LinkedIn has open-sourced a number of big data projects built on or to coexist with Hadoop. In celebration of LinkedIn&amp;#8217;s 10th anniversary, this post covers 10 of those projects (such as Voldemort and DataFu), including a brief overview of each.&lt;br/&gt;&lt;a href="http://www.hadoopsphere.com/2013/05/hadoops-10-in-linkedins-10.html" target="_blank"&gt;&lt;a href="http://www.hadoopsphere.com/2013/05/hadoops-10-in-linkedins-10.html" target="_blank"&gt;http://www.hadoopsphere.com/2013/05/hadoops-10-in-linkedins-10.html&lt;/a&gt;&lt;/a&gt;&lt;br/&gt; &lt;!-- more --&gt;&lt;br/&gt; Following the release of version 0.2.0 of the Cloudera Development Kit, the Cloudera blog has a new post with an overview of the project, an FAQ, and a list of future plans. They plan on having monthly releases and focusing on documentation in addition to software libraries. It should be an interesting project to watch.&lt;br/&gt;&lt;a href="http://blog.cloudera.com/blog/2013/05/cloudera-development-kit-cdk/" target="_blank"&gt;&lt;a href="http://blog.cloudera.com/blog/2013/05/cloudera-development-kit-cdk/" target="_blank"&gt;http://blog.cloudera.com/blog/2013/05/cloudera-development-kit-cdk/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; An overview of eBay&amp;#8217;s data warehouse, which ingests as much as 100TB/day and stores over 90PB.  To power internal analytics, they use a combination of Hadoop, Teradata, and a custom built system as data stores plus front-end tools Tableau, Excel, Microstrategy (and more).&lt;br/&gt;&lt;a href="http://www.itnews.com.au/News/342615,inside-ebay8217s-90pb-data-warehouse.aspx" target="_blank"&gt;&lt;a href="http://www.itnews.com.au/News/342615,inside-ebay8217s-90pb-data-warehouse.aspx" target="_blank"&gt;http://www.itnews.com.au/News/342615,inside-ebay8217s-90pb-data-warehouse.aspx&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; RCFile is a columnar format that&amp;#8217;s part of the Hive project. This post describes the motivation for RCFile as well as the benefits. In a follow-up post, the author will talk about the successor to RCFile &amp;#8212; ORCFile, which has similar features to the Parquet format.&lt;br/&gt;&lt;a href="http://www.bigdatarepublic.com/author.asp?section_id=2840&amp;amp;doc_id=262756" target="_blank"&gt;&lt;a href="http://www.bigdatarepublic.com/author.asp?section_id=2840&amp;amp;doc_id=262756" target="_blank"&gt;http://www.bigdatarepublic.com/author.asp?section_id=2840&amp;amp;doc_id=262756&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; An honest review of a new Hadoop book, the &amp;#8220;Hadoop Beginner&amp;#8217;s Guide&amp;#8221; talks about both the good and the bad in the book. Overall, the review is positive but notes that there are a few technical issues that could be improved.&lt;br/&gt;&lt;a href="http://architects.dzone.com/articles/review-hadoop-beginners-guide" target="_blank"&gt;&lt;a href="http://architects.dzone.com/articles/review-hadoop-beginners-guide" target="_blank"&gt;http://architects.dzone.com/articles/review-hadoop-beginners-guide&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Russell Jurney, the author of Agile Data, has posted slides to accompany his book. The slides cover a number of principles for Agile big data as well as a bunch of example code covering everything from data analysis with Pig to visualization with Bootstrap and D3.&lt;br/&gt;&lt;a href="http://www.slideshare.net/rjurney/agile-analytics-applications-on-hadoop-20839095" target="_blank"&gt;&lt;a href="http://www.slideshare.net/rjurney/agile-analytics-applications-on-hadoop-20839095" target="_blank"&gt;http://www.slideshare.net/rjurney/agile-analytics-applications-on-hadoop-20839095&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; When an HBase RegionServer fails, it can take a few seconds or minutes for the regions owned by that RegionServer to recover. Reducing this time, known as the Mean Time to Recover (MTTR), has been the subject of a lot of work on both the HBase and HDFS projects. This post has a good overview of the technical challenges and their solutions.&lt;br/&gt;&lt;a href="http://hortonworks.com/blog/introduction-to-hbase-mean-time-to-recover-mttr/" target="_blank"&gt;&lt;a href="http://hortonworks.com/blog/introduction-to-hbase-mean-time-to-recover-mttr/" target="_blank"&gt;http://hortonworks.com/blog/introduction-to-hbase-mean-time-to-recover-mttr/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt; Releases&lt;/strong&gt;&lt;br/&gt; Snakebite is a new python project that makes use of protocol buffers to talk to HDFS without going through the JVM. In addition to an API, it supplies a command line utility with similar functionality to &amp;#8220;hadoop fs&amp;#8221; (without the JVM startup overhead, it&amp;#8217;s a lot faster) and a script to startup a mini HDFS cluster, which it uses for testing.&lt;br/&gt;&lt;a href="http://labs.spotify.com/2013/05/07/snakebite/" target="_blank"&gt;&lt;a href="http://labs.spotify.com/2013/05/07/snakebite/" target="_blank"&gt;http://labs.spotify.com/2013/05/07/snakebite/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Azkaban 2.1 was released. This is the first point release to the Hadoop workflow management software since the version 2 rewrite. It has a bunch of new features, like JMX support, auto-retries, and SLA aware notifications.&lt;br/&gt;&lt;a href="https://groups.google.com/d/msg/azkaban-dev/WN5LWbqtsxE/PhUmmcO2lZIJ" target="_blank"&gt;&lt;a href="https://groups.google.com/d/msg/azkaban-dev/WN5LWbqtsxE/PhUmmcO2lZIJ" target="_blank"&gt;https://groups.google.com/d/msg/azkaban-dev/WN5LWbqtsxE/PhUmmcO2lZIJ&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; DataStax has made their ODBC driver for Hadoop/Hive free and has announced a new ODBC driver for Cassandra, currently in beta.&lt;br/&gt;&lt;a href="http://www.datastax.com/dev/blog/free-odbc-drivers-for-cassandra-and-hadoop-now-available" target="_blank"&gt;&lt;a href="http://www.datastax.com/dev/blog/free-odbc-drivers-for-cassandra-and-hadoop-now-available" target="_blank"&gt;http://www.datastax.com/dev/blog/free-odbc-drivers-for-cassandra-and-hadoop-now-available&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Apache Giraph, the computation framework for bulk synchronous parallel programming (often used for network graph algorithms), had its version 1.0 release, the first since graduating from the Apache Incubator. This release has a bunch of features, including support for running within YARN, support for accessing Hive tables, and improved performance and memory efficiency.&lt;br/&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/giraph-user/201305.mbox/%3C51888B3D.5040100@apache.org%3E" target="_blank"&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/giraph-user/201305.mbox/%3C51888B3D.5040100@apache.org%3E" target="_blank"&gt;http://mail-archives.apache.org/mod_mbox/giraph-user/201305.mbox/%3C51888B3D.5040100@apache.org%3E&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Apache Curator (incubating), is a set of Java libraries for Apache Zookeeper. The project was originally started at, and open sourced by, Netflix. This week they had their 2.0.0-incubating release, the first since joining the Apache incubator.&lt;br/&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/incubator-curator-user/201305.mbox/%3CEDA26D30-171C-481F-B740-4284A1C3B417%40jordanzimmerman.com%3E" target="_blank"&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/incubator-curator-user/201305.mbox/%3CEDA26D30-171C-481F-B740-4284A1C3B417%40jordanzimmerman.com%3E" target="_blank"&gt;http://mail-archives.apache.org/mod_mbox/incubator-curator-user/201305.mbox/%3CEDA26D30-171C-481F-B740-4284A1C3B417%40jordanzimmerman.com%3E&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; GigaOm notes that IBM is also an entrant into the SQL-on-Hadoop field, with a preview open to a limited number of participants.&lt;br/&gt;&lt;a href="http://gigaom.com/2013/05/06/look-ibm-is-doing-sql-on-hadoop-too/" target="_blank"&gt;&lt;a href="http://gigaom.com/2013/05/06/look-ibm-is-doing-sql-on-hadoop-too/" target="_blank"&gt;http://gigaom.com/2013/05/06/look-ibm-is-doing-sql-on-hadoop-too/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Events (curated by Mortar Data)&lt;/strong&gt;&lt;br/&gt; Monday, May 13&lt;br/&gt; Hackathon: Develop/MapReduce Your Dream Predictive Analytics/BigData App (Cambridge, MA)&lt;br/&gt;&lt;a href="http://www.meetup.com/Predictive-Analytics/events/109954442/" target="_blank"&gt;&lt;a href="http://www.meetup.com/Predictive-Analytics/events/109954442/" target="_blank"&gt;http://www.meetup.com/Predictive-Analytics/events/109954442/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Tuesday, May 14&lt;br/&gt; Storm, Brickhouse, ElasticSearch (San Francisco, CA)&lt;br/&gt;&lt;a href="http://www.meetup.com/Real-time-Big-Data/events/113867452/" target="_blank"&gt;&lt;a href="http://www.meetup.com/Real-time-Big-Data/events/113867452/" target="_blank"&gt;http://www.meetup.com/Real-time-Big-Data/events/113867452/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Tuesday, May 14&lt;br/&gt; Big Data is Not About The Data! (Chestnut Hill, MA)&lt;br/&gt;&lt;a href="http://www.meetup.com/intelligence/events/110894352/" target="_blank"&gt;&lt;a href="http://www.meetup.com/intelligence/events/110894352/" target="_blank"&gt;http://www.meetup.com/intelligence/events/110894352/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Tuesday, May 14&lt;br/&gt; Smarter Big Data Integration for Hadoop (London, UK)&lt;br/&gt;&lt;a href="http://www.meetup.com/hadoop-users-group-uk/events/115450432/" target="_blank"&gt;&lt;a href="http://www.meetup.com/hadoop-users-group-uk/events/115450432/" target="_blank"&gt;http://www.meetup.com/hadoop-users-group-uk/events/115450432/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Tuesday, May 14&lt;br/&gt; Introduction to Distributed Search using Cassandra with Solr and Lightning Talks (Columbia, MD)&lt;br/&gt;&lt;a href="http://www.meetup.com/Hadoop-DC/events/117691352/" target="_blank"&gt;&lt;a href="http://www.meetup.com/Hadoop-DC/events/117691352/" target="_blank"&gt;http://www.meetup.com/Hadoop-DC/events/117691352/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Wednesday, May 15&lt;br/&gt; R and Hadoop (Free Webinar) (Internet)&lt;br/&gt;&lt;a href="http://www.meetup.com/OC-HUG/events/113336802/" target="_blank"&gt;&lt;a href="http://www.meetup.com/OC-HUG/events/113336802/" target="_blank"&gt;http://www.meetup.com/OC-HUG/events/113336802/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Wednesday, May 15&lt;br/&gt; 37th Bay Area Hadoop User Group (HUG) Monthly Meetup (Sunnyvale, CA)&lt;br/&gt;&lt;a href="http://www.meetup.com/hadoop/events/69997352/" target="_blank"&gt;&lt;a href="http://www.meetup.com/hadoop/events/69997352/" target="_blank"&gt;http://www.meetup.com/hadoop/events/69997352/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Thursday, May 16&lt;br/&gt; Hackathon for Hadoop Administrators (Hyderabad, India)&lt;br/&gt;&lt;a href="http://www.meetup.com/hyderabad-hadoop/events/118712352/" target="_blank"&gt;&lt;a href="http://www.meetup.com/hyderabad-hadoop/events/118712352/" target="_blank"&gt;http://www.meetup.com/hyderabad-hadoop/events/118712352/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Thursday, May 16&lt;br/&gt; STORM for streaming analytics at scale (London, UK)&lt;br/&gt;&lt;a href="http://www.meetup.com/storm-london/events/118169452/" target="_blank"&gt;&lt;a href="http://www.meetup.com/storm-london/events/118169452/" target="_blank"&gt;http://www.meetup.com/storm-london/events/118169452/&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</description><link>http://blog.mortardata.com/post/50334790871</link><guid>http://blog.mortardata.com/post/50334790871</guid><pubDate>Mon, 13 May 2013 06:58:11 -0400</pubDate><category>Hadoop Weekly</category><category>hadoop</category></item><item><title>How to get Hilary Mason to build your recommender for free</title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;span&gt;&lt;/span&gt;&lt;a href="http://mortardata.com/team" target="_self"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_self"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;If you want &lt;a href="http://www.hilarymason.com" target="_blank"&gt;Hilary Mason&lt;/a&gt;, &lt;a href="http://www.drewconway.com" target="_blank"&gt;Drew Conway&lt;/a&gt;, or &lt;a href="http://www.shron.net" target="_blank"&gt;Max Shron&lt;/a&gt; to build your recommender for free, &lt;a href="http://www.mortardata.com" target="_blank"&gt;enter your email address here&lt;/a&gt;.&lt;br/&gt;&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;div class="center"&gt;&lt;img alt="Recommender system" height="200" src="http://static.tumblr.com/8g3qwvr/tGtmmg6d5/recommended_stamp.jpg" width="auto"/&gt;&lt;/div&gt;
&lt;p&gt;As a platform for working with data, we’ve seen users tackle lots of interesting use-cases: log analysis, natural language processing, pattern detection, and many more.&lt;/p&gt;
&lt;p&gt;However, perhaps no use-case is in greater demand than recommender systems.  If you have more “inventory” than your users can easily find (whether it&amp;#8217;s news, jobs, videos, restaurants, vacations, recipes, apps, etc.), a great recommender is crucial to driving engagement.&lt;/p&gt;
&lt;p&gt;The problem is that recommender systems are really hard to implement, so most companies either don&amp;#8217;t have one or aren&amp;#8217;t happy with what they have.&lt;/p&gt;
&lt;p&gt;What makes recommenders so tough?&lt;!-- more --&gt;&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;&lt;strong&gt;It’s painful to Obtain, Scrub, and Explore.&lt;/strong&gt;  These systems often require a lot of data, and the &amp;#8220;OSE&amp;#8221; steps from &lt;a href="http://www.dataists.com/2010/09/a-taxonomy-of-data-science/" target="_blank"&gt;“A Taxonomy of Data Science”&lt;/a&gt; can take 75% of the effort in building a recommender system.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Processing data at scale is difficult (at best).&lt;/strong&gt;  Those who have ever attempted setting up and maintaining a Hadoop cluster know how challenging it can be.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Data scientists are nearly impossible to find.&lt;/strong&gt; Looking at every user and everything that could be recommended to them typically requires a very good data scientist.  Unfortunately, &lt;a href="http://online.wsj.com/article/SB10001424052702304723304577365700368073674.html" target="_blank"&gt;there is a massive data scientist shortage&lt;/a&gt;, and the problem is only getting worse.&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;Although our users were finding Mortar hugely helpful for OSE and processing their data at scale, they still found they needed extra help to build the recommenders they wanted due to the scarcity of data science talent available to them.&lt;/p&gt;
&lt;p&gt;To remedy this problem, we’re custom-building recommenders for 10 companies by partnering with three world-class data scientists:&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;&lt;a href="http://www.hilarymason.com" target="_blank"&gt;Hilary Mason&lt;/a&gt; – Chief Scientist at Bitly&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.drewconway.com" target="_blank"&gt;Drew Conway&lt;/a&gt; – Author of Machine Learning for Hackers, Scientist in Residence at IA Ventures and Co-founder of DataKind&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.shron.net" target="_blank"&gt;Max Shron&lt;/a&gt; – Data strategy consultant (clients include Amazon.com, the Guardian, and Warby Parker), formerly lead data scientist at OkCupid&lt;/li&gt;
&lt;/ul&gt;&lt;p&gt;We’re going to select 10 companies to work with, building each of them a custom recommender system created with open technologies (Pig, Python, Java), for free.  We are open sourcing the generic components and giving the custom pieces to their respective companies to keep.&lt;/p&gt;
&lt;p&gt;Why would we do this?&lt;/p&gt;
&lt;p&gt;We want to provide future users reusable, open source components that work on real problems, at scale.  So while 10 companies will get custom-built solutions, all of our customers benefit by being able to more easily build recommenders on Mortar.&lt;/p&gt;
&lt;p&gt;We&amp;#8217;ve already accepted two companies to our program during our beta rollout, and now we’re looking for eight more.  Want a free recommender custom-built for you?  &lt;a href="http://www.mortardata.com" target="_blank"&gt;Enter your email address here.&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</description><link>http://blog.mortardata.com/post/49934459499</link><guid>http://blog.mortardata.com/post/49934459499</guid><pubDate>Wed, 08 May 2013 10:29:00 -0400</pubDate><category>recommender systems</category><category>data science</category></item><item><title>Hadoop Weekly - May 6, 2013</title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;span&gt;&lt;/span&gt;&lt;a href="http://mortardata.com/team" target="_self"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_self"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p&gt;&lt;em&gt;Hadoop Weekly is a recurring guest post by &lt;a href="http://www.crobak.org/" target="_blank"&gt;Joe Crobak&lt;/a&gt;&lt;/em&gt;&lt;em&gt;. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at &lt;a href="https://twitter.com/joecrobak" target="_blank"&gt;@joecrobak&lt;/a&gt;.&lt;/em&gt;&lt;br/&gt;&lt;br/&gt; There were two big and exciting releases this week from Hadoop vendors &amp;#8212; Cloudera with Impala and MapR with M7. In addition, this week marks the 500th subscriber to Hadoop Weekly! Thanks everyone for subscribing, and please send anything my way that you think might make a good addition to this newsletter.&lt;br/&gt;&lt;br/&gt;&lt;strong&gt; Technical&lt;/strong&gt;&lt;br/&gt; In the third and final part of his &amp;#8220;Introduction to Hadoop&amp;#8221; series, Tom White covers higher-level frameworks, anatomy of a Hadoop cluster, and data application pipelines. In terms of frameworks, he covers Pig, Hive, and Crunch (there&amp;#8217;s a nice example of computing top-K with Crunch).&lt;br/&gt;&lt;a href="http://www.drdobbs.com/database/introduction-to-hadoop-real-world-hadoop/240153375/" target="_blank"&gt;&lt;a href="http://www.drdobbs.com/database/introduction-to-hadoop-real-world-hadoop/240153375/" target="_blank"&gt;http://www.drdobbs.com/database/introduction-to-hadoop-real-world-hadoop/240153375/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt; &lt;!-- more --&gt;&lt;br/&gt; HBase uses Hadoop&amp;#8217;s metrics framework to expose metrics via JMX, HTTP, etc. This post covers the migration from the old metrics framework (&amp;#8220;metrics1&amp;#8221;) to the new framework (&amp;#8220;metrics2&amp;#8221;). Like most things that make use of a Hadoop API, there are extra complications because the hadoop-1 and hadoop-2 branches have diverged significantly.&lt;br/&gt;&lt;a href="https://blogs.apache.org/hbase/entry/migration_to_the_new_metrics" target="_blank"&gt;&lt;a href="https://blogs.apache.org/hbase/entry/migration_to_the_new_metrics" target="_blank"&gt;https://blogs.apache.org/hbase/entry/migration_to_the_new_metrics&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Cloudera Impala (low-latency SQL on HDFS) hit GA this week (more below), and some of the first benchmarks are starting to come out. Unsurprisingly, this presentation shows that snappy-compressed parquet files performed the best on their 11-node cluster, with average performance 12x that of Hive.&lt;br/&gt;&lt;a href="http://www.slideshare.net/sudabon/performance-evaluation-of-cloudera-impala-ga" target="_blank"&gt;&lt;a href="http://www.slideshare.net/sudabon/performance-evaluation-of-cloudera-impala-ga" target="_blank"&gt;http://www.slideshare.net/sudabon/performance-evaluation-of-cloudera-impala-ga&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; An interesting presentation about some real-world experience using Hadoop and Vertica for offline analytics at AdTech. They have some useful information about using Avro with Flume and Pig and the interaction between Hadoop and Vertica.&lt;br/&gt;&lt;a href="http://prezi.com/6r0rsx6hluu1/mapreduce-in-action-large-scale-reporting-based-on-hadoop-and-vertica/" target="_blank"&gt;&lt;a href="http://prezi.com/6r0rsx6hluu1/mapreduce-in-action-large-scale-reporting-based-on-hadoop-and-vertica/" target="_blank"&gt;http://prezi.com/6r0rsx6hluu1/mapreduce-in-action-large-scale-reporting-based-on-hadoop-and-vertica/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Andy Feng and Robert Evans recently presented on using Storm (a realtime distributed computation system) at Yahoo. In particular, their presentation covers collocating storm and hadoop on the same cluster using YARN.&lt;br/&gt;&lt;a href="http://www.slideshare.net/ydn/april-2013-hug" target="_blank"&gt;&lt;a href="http://www.slideshare.net/ydn/april-2013-hug" target="_blank"&gt;http://www.slideshare.net/ydn/april-2013-hug&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; A case-study of a Hive query that started out taking over 2 hours but was optimized to take under 10 minutes. They used a bunch of tricks to speed it up, including enabling map-side joins and optimizing the number or reducers.&lt;br/&gt;&lt;a href="http://dennyglee.com/2013/04/26/optimizing-joins-running-on-hdinsight-hive-on-azure-at-gfs/" target="_blank"&gt;&lt;a href="http://dennyglee.com/2013/04/26/optimizing-joins-running-on-hdinsight-hive-on-azure-at-gfs/" target="_blank"&gt;http://dennyglee.com/2013/04/26/optimizing-joins-running-on-hdinsight-hive-on-azure-at-gfs/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Ambari is an Apache Incubator project for provision, configuring, and monitoring Hadoop (and related services) clusters. Hortonworks posted a walkthrough for setting up your first Ambari cluster. Tools like this (another option is Cloudera Manager) are a must for managing a Hadoop cluster.&lt;br/&gt;&lt;a href="http://hortonworks.com/kb/get-started-setting-up-ambari/" target="_blank"&gt;&lt;a href="http://hortonworks.com/kb/get-started-setting-up-ambari/" target="_blank"&gt;http://hortonworks.com/kb/get-started-setting-up-ambari/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;News&lt;/strong&gt;&lt;br/&gt; Cloudera announced Impala 1.0&amp;#160;GA this week. It supports many of the same commands and datasets as Hive, but with much lower latency. Cloudera is reporting performance gains anywhere from 6x to 68x vs. Hive, and also better-than-linear scaling for multi-tenant queries. Given that Cloudera announced Impala less than a year ago, the fact that Impala has already reached version 1.0 is quite impressive and exciting (even if not all the features are there) &amp;#8212; especially so since Impala is open-source.&lt;br/&gt;&lt;a href="http://blog.cloudera.com/blog/2013/05/cloudera-impala-1-0-its-here-its-real-its-already-the-standard-for-sql-on-hadoop/" target="_blank"&gt;&lt;a href="http://blog.cloudera.com/blog/2013/05/cloudera-impala-1-0-its-here-its-real-its-already-the-standard-for-sql-on-hadoop/" target="_blank"&gt;http://blog.cloudera.com/blog/2013/05/cloudera-impala-1-0-its-here-its-real-its-already-the-standard-for-sql-on-hadoop/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; MapR announced Mapr M7 Edition. This version touts features like 5-9 availability and other features targeting the NoSQL market. Main highlights include &amp;#8220;no manual administrative tasks such as table merges or splits,&amp;#8221; data snapshots, and data mirroring.&lt;br/&gt;&lt;a href="http://www.mapr.com/products/mapr-editions/m7-edition" target="_blank"&gt;&lt;a href="http://www.mapr.com/products/mapr-editions/m7-edition" target="_blank"&gt;http://www.mapr.com/products/mapr-editions/m7-edition&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; SAS, the business analytics software company, has partnered with Cloudera to integrate their software with Hadoop.&lt;br/&gt;&lt;a href="http://www.marketwire.com/press-release/cloudera-announces-strategic-alliance-with-sas-1783697.htm" target="_blank"&gt;&lt;a href="http://www.marketwire.com/press-release/cloudera-announces-strategic-alliance-with-sas-1783697.htm" target="_blank"&gt;http://www.marketwire.com/press-release/cloudera-announces-strategic-alliance-with-sas-1783697.htm&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; MapR and LucidWorks announced a partnership to include LucidWorks Search with MapR&amp;#8217;s distributions. LucidWorks Search provides a number of advance features for Lucene/Solr. It&amp;#8217;s a little unclear what the technical details of this integration look like, but it sounds like they are providing indexing of data stored in MapR&amp;#8217;s FileSystem.&lt;br/&gt;&lt;a href="http://www.mapr.com/press-release/mapr-technologies-distributes-enterprise-grade-search-with-hadoop-platform" target="_blank"&gt;&lt;a href="http://www.mapr.com/press-release/mapr-technologies-distributes-enterprise-grade-search-with-hadoop-platform" target="_blank"&gt;http://www.mapr.com/press-release/mapr-technologies-distributes-enterprise-grade-search-with-hadoop-platform&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Hadoop Summit is coming up in less than 2 months in San Jose. The Hadoop Summit organizers, Hortonworks, have unveiled the full scheduled for the conference.&lt;br/&gt;&lt;a href="http://hortonworks.com/blog/hadoop-summit-schedule-is-now-available/" target="_blank"&gt;&lt;a href="http://hortonworks.com/blog/hadoop-summit-schedule-is-now-available/" target="_blank"&gt;http://hortonworks.com/blog/hadoop-summit-schedule-is-now-available/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; While many business analytics solutions focus on structured data accessible via SQL (over JDBC or ODBC), Precog is focusing on providing the same capabilities on unstructured data sitting in HDFS or elsewhere. They announced this week that their product is leaving beta.&lt;br/&gt;&lt;a href="http://gigaom.com/2013/05/01/precog-launches-with-a-plan-to-simplify-analytics-on-unstructured-data/" target="_blank"&gt;&lt;a href="http://gigaom.com/2013/05/01/precog-launches-with-a-plan-to-simplify-analytics-on-unstructured-data/" target="_blank"&gt;http://gigaom.com/2013/05/01/precog-launches-with-a-plan-to-simplify-analytics-on-unstructured-data/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Releases&lt;/strong&gt;&lt;br/&gt; Cloudera Development Kit 0.2.0 was released with experimental support for the new Parquet columnar file format as well as integration with Hive/HCatalog for metadata management.&lt;br/&gt;&lt;a href="https://groups.google.com/a/cloudera.org/d/msg/cdh-user/Q9teffdaJzY/rmtMpXoBMA4J" target="_blank"&gt;&lt;a href="https://groups.google.com/a/cloudera.org/d/msg/cdh-user/Q9teffdaJzY/rmtMpXoBMA4J" target="_blank"&gt;https://groups.google.com/a/cloudera.org/d/msg/cdh-user/Q9teffdaJzY/rmtMpXoBMA4J&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Amazon Elastic MapReduce now has support for S3&amp;#8217;s server-side encryption. This includes S3DistCp, the EMR-optimized distcp implementation.&lt;br/&gt;&lt;a href="http://aws.amazon.com/about-aws/whats-new/2013/05/01/amazon-elastic-mapreduce-now-supports-S3-server-side-encryption/" target="_blank"&gt;&lt;a href="http://aws.amazon.com/about-aws/whats-new/2013/05/01/amazon-elastic-mapreduce-now-supports-S3-server-side-encryption/" target="_blank"&gt;http://aws.amazon.com/about-aws/whats-new/2013/05/01/amazon-elastic-mapreduce-now-supports-S3-server-side-encryption/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Events (curated by Mortar Data)&lt;/strong&gt;&lt;br/&gt; Monday, May 6&lt;br/&gt; Deep Dive into Cloudera Impala! (New York, NY)&lt;br/&gt;&lt;a href="http://www.meetup.com/Hadoop-NYC/" target="_blank"&gt;&lt;a href="http://www.meetup.com/Hadoop-NYC/" target="_blank"&gt;http://www.meetup.com/Hadoop-NYC/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Tuesday, May 7&lt;br/&gt; NoSQL &amp;amp; Hadoop with Couchbase server (Tel Aviv, Israel)&lt;br/&gt;&lt;a href="http://www.meetup.com/Big-Data-Israel/events/115467802/" target="_blank"&gt;&lt;a href="http://www.meetup.com/Big-Data-Israel/events/115467802/" target="_blank"&gt;http://www.meetup.com/Big-Data-Israel/events/115467802/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Wednesday, May 8&lt;br/&gt; San Francisco Hadoop Meetup (San Francisco, CA)&lt;br/&gt;&lt;a href="http://www.meetup.com/hadoopsf/events/114514922/" target="_blank"&gt;&lt;a href="http://www.meetup.com/hadoopsf/events/114514922/" target="_blank"&gt;http://www.meetup.com/hadoopsf/events/114514922/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Thursday, May 9&lt;br/&gt; Hadoop At Spotify (Warsaw, Poland)&lt;br/&gt;&lt;a href="http://www.meetup.com/warsaw-hug/events/111492272/" target="_blank"&gt;&lt;a href="http://www.meetup.com/warsaw-hug/events/111492272/" target="_blank"&gt;http://www.meetup.com/warsaw-hug/events/111492272/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Thursday, May 9&lt;br/&gt; Analyzing Twitter: An End-to-End Data Pipeline (Baltimore, MD)&lt;br/&gt;&lt;a href="http://www.meetup.com/Data-Science-MD/events/111081282/" target="_blank"&gt;&lt;a href="http://www.meetup.com/Data-Science-MD/events/111081282/" target="_blank"&gt;http://www.meetup.com/Data-Science-MD/events/111081282/&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</description><link>http://blog.mortardata.com/post/49766847606</link><guid>http://blog.mortardata.com/post/49766847606</guid><pubDate>Mon, 06 May 2013 07:00:00 -0400</pubDate><category>Hadoop Weekly</category><category>hadoop</category></item><item><title>Hadoop Weekly - April 29, 2013</title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;span&gt;&lt;/span&gt;&lt;a href="http://mortardata.com/team" target="_self"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_self"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p&gt;&lt;em&gt;Hadoop Weekly is a recurring guest post by &lt;a href="http://www.crobak.org/" target="_blank"&gt;Joe Crobak&lt;/a&gt;&lt;/em&gt;&lt;em&gt;. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at &lt;a href="https://twitter.com/joecrobak" target="_blank"&gt;@joecrobak&lt;/a&gt;.&lt;/em&gt;&lt;br/&gt;&lt;br/&gt;The last full week of April was pretty busy for the Hadoop ecosystem &amp;#8212; two core projects (Hadoop and HBase) saw releases, there was also some exciting funding news (congrats to Qubole!), and there were plenty of interesting technical articles.&lt;br/&gt;&lt;strong&gt;&lt;br/&gt;Technical&lt;/strong&gt;&lt;br/&gt;The naming of components in Hadoop-related projects have often caused confusion (e.g. HDFS&amp;#8217; secondary namenode). Apache HBase is no exception &amp;#8212; the HMaster is often misunderstood, because unlike its name suggests, not all writes go through the HMaster. This article elaborates on the role of the HMaster in HBase.&lt;br/&gt;&lt;a href="https://blogs.apache.org/hbase/entry/hbase_who_needs_a_master" target="_blank"&gt;https://blogs.apache.org/hbase/entry/hbase_who_needs_a_master&lt;/a&gt;&lt;br/&gt;&lt;!-- more --&gt;&lt;br/&gt;Tachyon is the in-memory distributed file system from Berkeley&amp;#8217;s AMPLab that recently had its initial release. This article provides more details about the system, including how it might be good for MapReduce jobs and fit into the Hadoop ecosystem.&lt;br/&gt;&lt;a href="http://strata.oreilly.com/2013/04/tachyon-open-source-distributed-fault-tolerant-in-memory-file-system.html" target="_blank"&gt;http://strata.oreilly.com/2013/04/tachyon-open-source-distributed-fault-tolerant-in-memory-file-system.html&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&amp;#8220;Meet the Project Founder&amp;#8221; is a new blog series from Cloudera. Their first story features Doug Cutting, the founder of Hadoop and Cloudera&amp;#8217;s Chief Architect. Doug is incredibly prolific in open-source &amp;#8212; he&amp;#8217;s started the Apache Lucene, Nutch, Hadoop, and Avro projects.&lt;br/&gt;&lt;a href="http://blog.cloudera.com/blog/2013/04/meet-the-project-founder-doug-cutting-first-in-a-series/" target="_blank"&gt;http://blog.cloudera.com/blog/2013/04/meet-the-project-founder-doug-cutting-first-in-a-series/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;Gravity migrated from MySQL to HBase as their primary data store. In addition to talking about their use-case of both online and batch processing via MapReduce, this article speaks to the recent improvements in the ease of deployment of HBase and the Hadoop stack, and how it&amp;#8217;s changing the data storage landscape.&lt;br/&gt;&lt;a href="http://gigaom.com/2013/04/22/how-hbase-converted-myspaces-mysql-champion-and-is-driving-hadoop-mainstream/" target="_blank"&gt;http://gigaom.com/2013/04/22/how-hbase-converted-myspaces-mysql-champion-and-is-driving-hadoop-mainstream/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;Apache MRQL (incubating) is another low-latency analytics solution on HDFS solution. Unlike other systems, it&amp;#8217;s SQL-like but not SQL and can take advantage of Hama&amp;#8217;s Bulk Synchronous Parallel (BSP). These differences make it possible to do iterative processing, e.g. computing k-means (of which there is an example in this post).&lt;br/&gt;&lt;a href="http://www.hadoopsphere.com/2013/04/mrql-sql-on-hadoop-miracle.html" target="_blank"&gt;http://www.hadoopsphere.com/2013/04/mrql-sql-on-hadoop-miracle.html&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;In the second part in his series on Dr. Dobbs, Tom White gives a thorough walkthrough of writing your first MapReduce job. He covers the classic &amp;#8220;hello world&amp;#8221; of MapReduce &amp;#8212; word count.&lt;br/&gt;&lt;a href="http://www.drdobbs.com/database/hadoop-writing-and-running-your-first-pr/240153197/" target="_blank"&gt;http://www.drdobbs.com/database/hadoop-writing-and-running-your-first-pr/240153197/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;Since entering the Hadoop-game, VMWare has been improving Hadoop on virtualized hardware. This article covers 7 myths related to Hadoop &amp;#8212; some related to virtualization (and probably controversial) and more generally applicable.&lt;br/&gt;&lt;a href="http://blogs.vmware.com/vfabric/2013/04/myths-about-running-hadoop-in-a-virtualized-environment.html" target="_blank"&gt;http://blogs.vmware.com/vfabric/2013/04/myths-about-running-hadoop-in-a-virtualized-environment.html&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;It can be difficult to keep up with all the SQL-on-Hadoop solutions (it seems like there is a new one each week!). This article covers four of them &amp;#8212; Impala, Hadapt, Hawq, and Berkely Data Analytics Suite (BDAS) &amp;#8212; including the trade-offs you make when selecting one or the other (and importantly, the maturity of the product).&lt;br/&gt;&lt;a href="http://www.openbi.com/content/sql-hadoop" target="_blank"&gt;http://www.openbi.com/content/sql-hadoop&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;This article covers converting the Hortonworks Sandbox virtual machine image to a Rackspace instance. It&amp;#8217;s a pretty interesting idea, and the process is appears to be quite easy.&lt;br/&gt;&lt;a href="http://devops.rackspace.com/getting-started-with-hadoop-using-hortonworks-sandbox.html" target="_blank"&gt;http://devops.rackspace.com/getting-started-with-hadoop-using-hortonworks-sandbox.html&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;Based upon a Dell Whitepaper, Hortonworks has highlighted six important hardware decisions for designing a Hadoop cluster &amp;#8212; from the operating system to storage to network.&lt;br/&gt;&lt;a href="http://hortonworks.com/blog/6-key-hardware-considerations-for-deploying-hadoop-in-your-environment/" target="_blank"&gt;http://hortonworks.com/blog/6-key-hardware-considerations-for-deploying-hadoop-in-your-environment/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;These slides give a good technical overview of the HAWQ (Greenplums SQL-on-HDFS solution) architecture (starting on page 21) as well as the features of Spring&amp;#8217;s Hadoop integration (starting on page 37).&lt;br/&gt;&lt;a href="http://www.slideshare.net/marklpollack/pivotal-hd-and-spring" target="_blank"&gt;http://www.slideshare.net/marklpollack/pivotal-hd-and-spring&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;On the WANdisco blog, Konstantin Boudnik provides an interesting analysis of the Hadoop 2.0-alpha series, which as he notes is on its 5th release in the past 11 months.&lt;br/&gt;&lt;a href="http://blogs.wandisco.com/2013/04/22/hadoop-2-alpha-elephant-or-not/" target="_blank"&gt;http://blogs.wandisco.com/2013/04/22/hadoop-2-alpha-elephant-or-not/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;In the paper, &amp;#8220;Nobody ever got ﬁred for using Hadoop on a cluster&amp;#8221; (no link, due to copyright restrictions), the authors observe that while MapReduce is great for many tasks, there are a growing number of situations (mostly due to the dropping price of memory) in which data can fit in RAM on a single machine.&lt;br/&gt;&lt;br/&gt;When getting started with Hadoop, it can be a challenge just to decide which distribution to use &amp;#8212; it seems like each week a new vendor is announcing their new distribution. This fragmentation has been a source of discussion recently, and this article speaks a bit about that and also celebrates Apache BigTop as the system making all of these releases possible.&lt;br/&gt;&lt;a href="http://blogs.wandisco.com/2013/04/22/on-coming-fragmentation-of-hadoop-platform/" target="_blank"&gt;http://blogs.wandisco.com/2013/04/22/on-coming-fragmentation-of-hadoop-platform/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;News&lt;/strong&gt;&lt;br/&gt;Qubole, the Hadoop-as-a-Service startup from a team of former Facebook employees, has raised $7 million in Series A financing. It&amp;#8217;s fantastic to see a vote of confidence in a company that&amp;#8217;s lowering the barrier to entry of Hadoop.&lt;br/&gt;&lt;a href="http://gigaom.com/2013/04/23/hadoop-startup-qubole-raises-7m-for-hive-as-a-service/" target="_blank"&gt;http://gigaom.com/2013/04/23/hadoop-startup-qubole-raises-7m-for-hive-as-a-service/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;Cloudera&amp;#8217;s HUE has some interesting new features in the recent 2.3 release. They include Oozie improvements and a new Pig Editor.&lt;br/&gt;&lt;a href="http://blog.cloudera.com/blog/2013/04/whats-new-in-hue-2-3/" target="_blank"&gt;http://blog.cloudera.com/blog/2013/04/whats-new-in-hue-2-3/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;Spring XD is a new project from the folks at SpringSource focusing on tools for data ingestion, real-time analytics, workflow management, and data export.&lt;br/&gt;&lt;a href="http://blog.springsource.org/2013/04/23/introducing-spring-xd/" target="_blank"&gt;http://blog.springsource.org/2013/04/23/introducing-spring-xd/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;Sematext&amp;#8217;s Performance Monitoring (SPM) suite is adding support for Hadoop. SPM is a proactive monitoring tool that can be self-hosted or used as a service.&lt;br/&gt;&lt;a href="http://blog.sematext.com/2013/04/23/sneak-peek-hadoop-monitoring-comes-to-spm/" target="_blank"&gt;http://blog.sematext.com/2013/04/23/sneak-peek-hadoop-monitoring-comes-to-spm/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Releases&lt;/strong&gt;&lt;br/&gt;HBase 0.94.7 was released and is the new stable version. The release contains performance improvements and bug fixes.&lt;br/&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/hbase-user/201304.mbox/%3C1366934770.31322.YahooMailNeo%40web140601.mail.bf1.yahoo.com%3E" target="_blank"&gt;http://mail-archives.apache.org/mod_mbox/hbase-user/201304.mbox/%3C1366934770.31322.YahooMailNeo%40web140601.mail.bf1.yahoo.com%3E&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;Hadoop 2.0.4-alpha was released. This is intended to be the final alpha release, with a beta following up soon.&lt;br/&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/hadoop-general/201304.mbox/%3C5B011941-90CA-4EEF-BAB9-39A6BFE99B1D%40hortonworks.com%3E" target="_blank"&gt;http://mail-archives.apache.org/mod_mbox/hadoop-general/201304.mbox/%3C5B011941-90CA-4EEF-BAB9-39A6BFE99B1D%40hortonworks.com%3E&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;KijiSchema and KijiMR, the HBase schema management software and accompanying MapReduce libraries were updated to versions 1.0.2 and rc61. These releases include bug fixes and improvements.&lt;br/&gt;&lt;a href="http://www.kiji.org/2013/04/22/announcing-kijischema-1-0-2-and-kijimr-rc61/" target="_blank"&gt;http://www.kiji.org/2013/04/22/announcing-kijischema-1-0-2-and-kijimr-rc61/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;The MySQL project has announced a new product (currently in the MySQL labs) called the MySQL Hadoop Applier for replaying the mysql binlog onto a file in HDFS. Notably, it currently only supports INSERT commands, but data can be inserted into HDFS in near-real time.&lt;br/&gt;&lt;a href="http://dev.mysql.com/tech-resources/articles/mysql-hadoop-applier.html" target="_blank"&gt;http://dev.mysql.com/tech-resources/articles/mysql-hadoop-applier.html&lt;/a&gt;&lt;br/&gt;&lt;a href="http://innovating-technology.blogspot.com/2013/04/mysql-hadoop-applier-part-1.html" target="_blank"&gt;http://innovating-technology.blogspot.com/2013/04/mysql-hadoop-applier-part-1.html&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;Cloudera released CDH 4.2.1, which includes a number of improvements and bug fixes.&lt;br/&gt;&lt;a href="https://groups.google.com/a/cloudera.org/d/msg/cdh-user/yM2tng8-kqI/LinM89A4vhUJ" target="_blank"&gt;https://groups.google.com/a/cloudera.org/d/msg/cdh-user/yM2tng8-kqI/LinM89A4vhUJ&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Events &lt;/strong&gt;(curated by Mortar Data)&lt;br/&gt;Tuesday, April 30&lt;br/&gt;Big Data Jobs in London Meetup (London, UK)&lt;br/&gt;&lt;a href="http://www.meetup.com/Big-Data-Jobs-in-London/events/110496712/" target="_blank"&gt;http://www.meetup.com/Big-Data-Jobs-in-London/events/110496712/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;Improving Hive; MapR Hbase M7 (Washington D.C.)&lt;br/&gt;&lt;a href="http://www.meetup.com/Hadoop-DC/events/114264532/" target="_blank"&gt;http://www.meetup.com/Hadoop-DC/events/114264532/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;Cloudera - Enterprise Big Data Platform (Hamilton Township, NJ)&lt;br/&gt;&lt;a href="http://www.meetup.com/nj-hadoop/events/113996532/" target="_blank"&gt;http://www.meetup.com/nj-hadoop/events/113996532/&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;&lt;br/&gt;Wednesday, May 1&lt;br/&gt;Hadoop Hackathon! (Houston, TX)&lt;br/&gt;&lt;a href="http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/102954462/" target="_blank"&gt;http://www.meetup.com/Houston-Hadoop-Meetup-Group/events/102954462/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;Thursday, May 2&lt;br/&gt;Big Data, Data Science, and Hadoop (San Francisco, CA)&lt;br/&gt;&lt;a href="http://www.meetup.com/San-Francisco-Bay-Area-Microsoft-BI-User-Group/events/114347422/" target="_blank"&gt;http://www.meetup.com/San-Francisco-Bay-Area-Microsoft-BI-User-Group/events/114347422/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;Thursday, May 2&lt;br/&gt;Data Science for Sustainability (Redwood City, CA)&lt;br/&gt;&lt;a href="http://www.meetup.com/Data-Science-for-Sustainability/events/113231972/" target="_blank"&gt;http://www.meetup.com/Data-Science-for-Sustainability/events/113231972/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;Saturday, May 4&lt;br/&gt;Accumulo Hackathon (Washington D.C.)&lt;br/&gt;&lt;a href="http://www.meetup.com/Hadoop-DC/events/112435332/" target="_blank"&gt;http://www.meetup.com/Hadoop-DC/events/112435332/&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</description><link>http://blog.mortardata.com/post/49171067948</link><guid>http://blog.mortardata.com/post/49171067948</guid><pubDate>Mon, 29 Apr 2013 06:53:00 -0400</pubDate><category>Hadoop Weekly</category><category>hadoop</category></item><item><title>Hadoop Weekly - April 22, 2013</title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;span&gt;&lt;/span&gt;&lt;a href="http://mortardata.com/team" target="_self"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_self"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p&gt;&lt;em&gt;Hadoop Weekly is a recurring guest post by &lt;a href="http://www.crobak.org/" target="_blank"&gt;Joe Crobak&lt;/a&gt;&lt;/em&gt;&lt;em&gt;. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at &lt;a href="https://twitter.com/joecrobak" target="_blank"&gt;@joecrobak&lt;/a&gt;.&lt;/em&gt;&lt;br/&gt;&lt;br/&gt;There were a number of exciting announcements and releases this week (e.g. Hadoop on OpenStack, Impala 0.7) as well as some fantastic technical articles and tutorials. It&amp;#8217;s great to see more technical articles about how folks are doing things with Hadoop &amp;#8212; this week covering Hadoop internals, data formats, and MapReduce-based mobile UI customization. A big thanks to those that share their insights and experiences for making this newsletter possible!&lt;br/&gt; &lt;!-- more --&gt;&lt;br/&gt;&lt;strong&gt; News&lt;/strong&gt;&lt;br/&gt; Cloudera has announced the Cloudera Academic Partnership program with seven universities part of the initial program. Cloudera cites the need for Hadoop-related expertise as the main motivation for the program.&lt;br/&gt;&lt;a href="http://siliconangle.com/blog/2013/04/19/closing-the-gap-on-big-data-education-cloudera-teams-with-top-universities-around-the-world/" target="_blank"&gt;&lt;a href="http://siliconangle.com/blog/2013/04/19/closing-the-gap-on-big-data-education-cloudera-teams-with-top-universities-around-the-world/" target="_blank"&gt;http://siliconangle.com/blog/2013/04/19/closing-the-gap-on-big-data-education-cloudera-teams-with-top-universities-around-the-world/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="http://blog.cloudera.com/blog/2013/04/cloudera-academic-partnership-program-creating-hadoop-lovers-in-universities-worldwide/" target="_blank"&gt;&lt;a href="http://blog.cloudera.com/blog/2013/04/cloudera-academic-partnership-program-creating-hadoop-lovers-in-universities-worldwide/" target="_blank"&gt;http://blog.cloudera.com/blog/2013/04/cloudera-academic-partnership-program-creating-hadoop-lovers-in-universities-worldwide/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;/a&gt;Apache BigTop is a system for testing the components of the Hadoop stack in conjunction with one another. This post highlights why it&amp;#8217;s becoming popular (particularly with vendors) even though it&amp;#8217;s not in the spotlight.&lt;br/&gt;&lt;a href="https://blogs.apache.org/bigtop/entry/bigtop_the_way_to_grow" target="_blank"&gt;&lt;a href="https://blogs.apache.org/bigtop/entry/bigtop_the_way_to_grow" target="_blank"&gt;https://blogs.apache.org/bigtop/entry/bigtop_the_way_to_grow&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; We&amp;#8217;ve been seeing a lot of new products focusing on running SQL on HDFS. Most of these products distribute worker-nodes alongside the datanodes. Teradata has taken a different approach (they kind of have to since they ship an appliance). This week, in addition to announcing a new set of hardware, they announced SQL-H which gives Teradata access to data stored on HDFS by using HCatalog to get metadata about the files in HDFS.&lt;br/&gt;&lt;a href="http://www.asterdata.com/blog/2013/04/the-enterprise-approach-to-interactive-sql-on-hadoop-data-teradata-sql-h/" target="_blank"&gt;&lt;a href="http://www.asterdata.com/blog/2013/04/the-enterprise-approach-to-interactive-sql-on-hadoop-data-teradata-sql-h/" target="_blank"&gt;http://www.asterdata.com/blog/2013/04/the-enterprise-approach-to-interactive-sql-on-hadoop-data-teradata-sql-h/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="http://www.dbms2.com/2013/04/15/teradata-sql-h/" target="_blank"&gt;&lt;a href="http://www.dbms2.com/2013/04/15/teradata-sql-h/" target="_blank"&gt;http://www.dbms2.com/2013/04/15/teradata-sql-h/&lt;/a&gt;&lt;/a&gt; (see the first comment)&lt;br/&gt;&lt;br/&gt; Mirantis, Hortonworks, and Red Hat are working on project Savanna to bring Hadoop support to OpenStack (OpenStack is software for managing cloud computing software). It sounds like they&amp;#8217;re targeting an initial release for June in time for Hadoop Summit, and that they have some grand plans &amp;#8212; everything from provisioning bare-metal hardware to enabling something like Amazon&amp;#8217;s Elastic MapReduce.&lt;br/&gt;&lt;a href="http://www.theregister.co.uk/2013/04/18/project_savanna_hadoop_on_openstack/" target="_blank"&gt;&lt;a href="http://www.theregister.co.uk/2013/04/18/project_savanna_hadoop_on_openstack/" target="_blank"&gt;http://www.theregister.co.uk/2013/04/18/project_savanna_hadoop_on_openstack/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="http://hortonworks.com/blog/hadoop-perect-app-for-openstack/" target="_blank"&gt;&lt;a href="http://hortonworks.com/blog/hadoop-perect-app-for-openstack/" target="_blank"&gt;http://hortonworks.com/blog/hadoop-perect-app-for-openstack/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="https://wiki.openstack.org/wiki/Savanna" target="_blank"&gt;&lt;a href="https://wiki.openstack.org/wiki/Savanna" target="_blank"&gt;https://wiki.openstack.org/wiki/Savanna&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Technical&lt;/strong&gt;&lt;br/&gt; This is a great overview of the components in the Hadoop stack other than HDFS and MapReduce &amp;#8212; in particular, HBase, Cassandra, Pig, Hive, and Impala. It also discusses a few other SQL-on-Hadoop solutions.&lt;br/&gt;&lt;a href="http://binalytics.wordpress.com/2013/04/20/quick-hadoop-overview/" target="_blank"&gt;&lt;a href="http://binalytics.wordpress.com/2013/04/20/quick-hadoop-overview/" target="_blank"&gt;http://binalytics.wordpress.com/2013/04/20/quick-hadoop-overview/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; While its architecture is easy to understand, HDFS is a complex piece of software that oftentimes seems to work as if by magic.  This article discusses the architecture and starts diving into the software stack &amp;#8212; providing a map for someone trying to navigate the source code.&lt;br/&gt;&lt;a href="http://www.javacodegeeks.com/2013/04/how-hadoop-works-hdfs-case-study.html" target="_blank"&gt;&lt;a href="http://www.javacodegeeks.com/2013/04/how-hadoop-works-hdfs-case-study.html" target="_blank"&gt;http://www.javacodegeeks.com/2013/04/how-hadoop-works-hdfs-case-study.html&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Tom White, the author of Hadoop, The Definitive Guide, is writing a series of posts for Dr. Dobb&amp;#8217;s about Hadoop. The first article has an overview of HDFS and MapReduce as well as an introduction to various other systems in the Hadoop stack like Flume, Pig, Hive, and HBase.&lt;br/&gt;&lt;a href="http://www.drdobbs.com/database/hadoop-the-lay-of-the-land/240150854" target="_blank"&gt;&lt;a href="http://www.drdobbs.com/database/hadoop-the-lay-of-the-land/240150854" target="_blank"&gt;http://www.drdobbs.com/database/hadoop-the-lay-of-the-land/240150854&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; LinkedIn is using Hadoop-based algorithms to customize the UI on their mobile apps, where real estate is limited. Their infrastructure includes Kafka for data ingestion, a Hadoop workflow for building recommendations (which they describe in some detail), and Voldemort for serving the data in real-time.&lt;br/&gt;&lt;a href="http://engineering.linkedin.com/mobile/linkedin-mobile-introducing-personalized-navigation" target="_blank"&gt;&lt;a href="http://engineering.linkedin.com/mobile/linkedin-mobile-introducing-personalized-navigation" target="_blank"&gt;http://engineering.linkedin.com/mobile/linkedin-mobile-introducing-personalized-navigation&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; At the Twitter Seattle Open House, Julien Le Dem presented on Parquet, the new columnar storage format that Twitter is building in collaboration with Cloudera. The slides include a great overview of the use-case, the file format, and some initial benchmarks.&lt;br/&gt;&lt;a href="http://www.slideshare.net/julienledem/parquet-twitter-seattle-open-house" target="_blank"&gt;&lt;a href="http://www.slideshare.net/julienledem/parquet-twitter-seattle-open-house" target="_blank"&gt;http://www.slideshare.net/julienledem/parquet-twitter-seattle-open-house&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Tutorials&lt;/strong&gt;&lt;br/&gt; Cloudera HUE provides a web interface to interact with Hadoop to upload and browse data as well as run Hive and MapReduce jobs. In this tutorial, you&amp;#8217;ll load a dataset from the Yelp challenge into Hive, run some SQL queries on it, and then run a python streaming MapReduce job using MrJob.&lt;br/&gt;&lt;a href="http://blog.cloudera.com/blog/2013/04/demo-analyzing-data-with-hue-and-hive/" target="_blank"&gt;&lt;a href="http://blog.cloudera.com/blog/2013/04/demo-analyzing-data-with-hue-and-hive/" target="_blank"&gt;http://blog.cloudera.com/blog/2013/04/demo-analyzing-data-with-hue-and-hive/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Redis is a key-value store that supports various data structures such as lists, sets, strings, and more. This tutorial covers getting data in and out of Redis from MapReduce, including the code for custom input formats, record readers, and output formats.&lt;br/&gt;&lt;a href="http://www.greenplum.com/blog/topics/hadoop/making-hadoop-mapreduce-work-with-a-redis-cluster" target="_blank"&gt;&lt;a href="http://www.greenplum.com/blog/topics/hadoop/making-hadoop-mapreduce-work-with-a-redis-cluster" target="_blank"&gt;http://www.greenplum.com/blog/topics/hadoop/making-hadoop-mapreduce-work-with-a-redis-cluster&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; A tutorial covering running Apache Mahout on HDInsight (HDInsight is the Hadoop Distribution running on Windows Azure). Covers install, setup, and running a Mahout MapReduce job.&lt;br/&gt;&lt;a href="http://bluewatersql.wordpress.com/2013/04/12/installing-mahout-for-hdinsight-on-windows-server/" target="_blank"&gt;&lt;a href="http://bluewatersql.wordpress.com/2013/04/12/installing-mahout-for-hdinsight-on-windows-server/" target="_blank"&gt;http://bluewatersql.wordpress.com/2013/04/12/installing-mahout-for-hdinsight-on-windows-server/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt; Releases&lt;/strong&gt;&lt;br/&gt; Cloudera Impala 0.7 was released (and a few days later the 0.7.1 release with some critical bug fixes was announced). Version 0.7.1 has a bunch of new features, including support for the Parquet columnar file format and avro, plus distributed aggregations and top-n computations. This release supports CDH4.1 and 4.2 as well as a number of different linux distributions.&lt;br/&gt;&lt;a href="https://groups.google.com/a/cloudera.org/d/msg/impala-user/2EUowODoRIk/44S7Q044-_UJ" target="_blank"&gt;&lt;a href="https://groups.google.com/a/cloudera.org/d/msg/impala-user/2EUowODoRIk/44S7Q044-_UJ" target="_blank"&gt;https://groups.google.com/a/cloudera.org/d/msg/impala-user/2EUowODoRIk/44S7Q044-_UJ&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="https://groups.google.com/a/cloudera.org/d/msg/impala-user/LuWvbjUY0EU/mHKJYc2Blm4J" target="_blank"&gt;&lt;a href="https://groups.google.com/a/cloudera.org/d/msg/impala-user/LuWvbjUY0EU/mHKJYc2Blm4J" target="_blank"&gt;https://groups.google.com/a/cloudera.org/d/msg/impala-user/LuWvbjUY0EU/mHKJYc2Blm4J&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Apache MRUnit, the MapReduce unit-testing library reached version 1.0.0. It supports both hadoop 1 and hadoop 2.&lt;br/&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201304.mbox/%3CCAFZSZPsXMx6taV-%2B3SC7GKaV1M%2BkRE3TMLk%3DcDZ-5GAn8UceZQ%40mail.gmail.com%3E" target="_blank"&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201304.mbox/%3CCAFZSZPsXMx6taV-%2B3SC7GKaV1M%2BkRE3TMLk%3DcDZ-5GAn8UceZQ%40mail.gmail.com%3E" target="_blank"&gt;http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201304.mbox/%3CCAFZSZPsXMx6taV-%2B3SC7GKaV1M%2BkRE3TMLk%3DcDZ-5GAn8UceZQ%40mail.gmail.com%3E&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Last week, Amazon announced support for Elastic MapReduce on their GovCloud service.&lt;br/&gt;&lt;a href="http://www.theregister.co.uk/2013/04/09/amazon_adds_hadoop_to_govcloud/" target="_blank"&gt;&lt;a href="http://www.theregister.co.uk/2013/04" target="_blank"&gt;http://www.theregister.co.uk/2013/04&lt;/a&gt;&lt;/a&gt;&lt;a href="http://www.theregister.co.uk/2013/04/09/amazon_adds_hadoop_to_govcloud/" target="_blank"&gt;/09/amazon_adds_hadoop_to_govcloud/&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; UC Berkeley&amp;#8217;s AMPLab, the same lab that develops Spark, has announced the Tachyon Project. Tachyon is a distributed file system that can cache some datasets in memory, but it checkpoints data to an underlying file system (it currently supports HDFS or a single node local file system).&lt;br/&gt;&lt;a href="http://tachyon-project.org/" target="_blank"&gt;&lt;a href="http://tachyon-project.org/" target="_blank"&gt;http://tachyon-project.org/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Events&lt;/strong&gt;&lt;br/&gt; Monday, April 22&lt;br/&gt; Cloudera Sessions (Toronto, Canada)&lt;br/&gt;&lt;a href="http://www.meetup.com/TorontoHUG/events/114398482/" target="_blank"&gt;&lt;a href="http://www.meetup.com/TorontoHUG/events/114398482/" target="_blank"&gt;http://www.meetup.com/TorontoHUG/events/114398482/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Tuesday, April 23&lt;br/&gt; Natural Language Processing and Big Data (Washington, DC)&lt;br/&gt;&lt;a href="http://www.meetup.com/Data-Science-DC/events/109386702/" target="_blank"&gt;&lt;a href="http://www.meetup.com/Data-Science-DC/events/109386702/" target="_blank"&gt;http://www.meetup.com/Data-Science-DC/events/109386702/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Wednesday, April 24&lt;br/&gt; Big Data @ Yelp &amp;#8212; taming the reviews &amp;amp; recommendations (San Jose, CA)&lt;br/&gt;&lt;a href="http://www.meetup.com/BigDataGurus/events/114645332/" target="_blank"&gt;&lt;a href="http://www.meetup.com/BigDataGurus/events/114645332/" target="_blank"&gt;http://www.meetup.com/BigDataGurus/events/114645332/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Thursday, April 25&lt;br/&gt; Data in the Big City (New York, NY)&lt;br/&gt;&lt;a href="http://www.meetup.com/DataKind-NYC/events/112727792/" target="_blank"&gt;&lt;a href="http://www.meetup.com/DataKind-NYC/events/112727792/" target="_blank"&gt;http://www.meetup.com/DataKind-NYC/events/112727792/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Thursday, April 25&lt;br/&gt; Power in Numbers: Growing Atlanta&amp;#8217;s Data Science Talent (Atlanta. Georgia)&lt;br/&gt;&lt;a href="http://www.meetup.com/Data-Science-ATL/events/109289502/" target="_blank"&gt;&lt;a href="http://www.meetup.com/Data-Science-ATL/events/109289502/" target="_blank"&gt;http://www.meetup.com/Data-Science-ATL/events/109289502/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Thursday, April 25&lt;br/&gt; Bigvis: visualising 100,000,000 observations in R with Hadley Wickham  (New York, NY)&lt;br/&gt;&lt;a href="http://www.meetup.com/nyhackr/events/112271042/" target="_blank"&gt;&lt;a href="http://www.meetup.com/nyhackr/events/112271042/" target="_blank"&gt;http://www.meetup.com/nyhackr/events/112271042/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Saturday, April 27&lt;br/&gt; Map Reduce Programming - Deep Dive (Santa Clara, CA)&lt;br/&gt;&lt;a href="http://mapreduce-deep-dive-es2005.eventbrite.com/?rank=6" target="_blank"&gt;&lt;a href="http://mapreduce-deep-dive-es2005.eventbrite.com/?rank=6" target="_blank"&gt;http://mapreduce-deep-dive-es2005.eventbrite.com/?rank=6&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</description><link>http://blog.mortardata.com/post/48604832497</link><guid>http://blog.mortardata.com/post/48604832497</guid><pubDate>Mon, 22 Apr 2013 06:50:51 -0400</pubDate><category>Hadoop Weekly</category><category>hadoop</category></item><item><title> Hadoop Weekly - April 15, 2013 </title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;span&gt;&lt;/span&gt;&lt;a href="http://mortardata.com/team" target="_self"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_self"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;div class="copy"&gt;
&lt;p&gt;&lt;em&gt;Hadoop Weekly is a recurring guest post by &lt;a href="http://www.crobak.org/" target="_blank"&gt;Joe Crobak&lt;/a&gt;&lt;/em&gt;&lt;em&gt;. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at &lt;a href="https://twitter.com/joecrobak" target="_blank"&gt;@joecrobak&lt;/a&gt;.&lt;/em&gt;&lt;br/&gt;&lt;br/&gt; This week&amp;#8217;s newsletter features fewer releases than normal (let me know if I missed something!) but has a lot of interesting technical articles. In addition, I&amp;#8217;m pleased to announce the return of an events section. Thanks to the folks at Mortar Data for curating this list! They&amp;#8217;ve found a number of great Hadoop-related events taking place all over the world this week.&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Technical&lt;/strong&gt;&lt;br/&gt; Apache Pig provides support for expressive SQL-like join operations. In this post, Matthew Rathbone shows how to implement a left-outer join in Pig and write a unit test to check for correctness. This is his third article that demos a framework &amp;#8212; he previously covered MapReduce and Hive. This trifecta is quite an interesting comparison, so be sure to read all three if you missed the previous articles.&lt;br/&gt;&lt;a href="http://blog.matthewrathbone.com/2013/04/07/real-world-hadoop---implementing-a-left-outer-join-in-pig.html" target="_blank"&gt;&lt;a href="http://blog.matthewrathbone.com/2013/04/07/real-world-hadoop---implementing-a-left-outer-join-in-pig.html" target="_blank"&gt;http://blog.matthewrathbone.com/2013/04/07/real-world-hadoop&amp;#8212;-implementing-a-left-outer-join-in-pig.html&lt;/a&gt;&lt;br/&gt; &lt;!-- more --&gt;&lt;br/&gt;&lt;/a&gt;If you&amp;#8217;re reading this newsletter, you probably don&amp;#8217;t need convincing, but Ofer Mendelevitch from Hortonworks offers some compelling reasons to use Hadoop for Data Science. Each reason (such as &amp;#8220;Data exploration with full datasets&amp;#8221;) includes a discussion, and some reasons include a discussion of the tools available to aid a data scientist.&lt;br/&gt;&lt;a href="http://hortonworks.com/blog/4-reasons-to-use-hadoop-for-data-science/" target="_blank"&gt;&lt;a href="http://hortonworks.com/blog/4-reasons-to-use-hadoop-for-data-science/" target="_blank"&gt;http://hortonworks.com/blog/4-reasons-to-use-hadoop-for-data-science/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Apache Ambari is a system for managing and configuring Hadoop and related projects such as Apache Zookeeper. This tutorial covers configuring a 6 node test cluster on EC2 with HDFS, Mapreduce, Nagios, Ganglia, HBase, ZooKeeper, Hive, HCatalog, and Zookeeper.&lt;br/&gt;&lt;a href="http://hortonworks.com/kb/ambari-on-ec2/" target="_blank"&gt;&lt;a href="http://hortonworks.com/kb/ambari-on-ec2/" target="_blank"&gt;http://hortonworks.com/kb/ambari-on-ec2/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Vagrant provides a command line interface and tools to spin up and configure virtual machines with a Virtualbox, VMWare, or a cloud provider. This post explains using Vagrant to build and configure a (virtual machine) Hadoop cluster. The recipe from the post lets one build a 6-node cluster with a single command, &amp;#8216;vagrant up&amp;#8217;.&lt;br/&gt;&lt;a href="http://blog.cloudera.com/blog/2013/04/how-to-use-vagrant-to-set-up-a-virtual-hadoop-cluster/" target="_blank"&gt;&lt;a href="http://blog.cloudera.com/blog/2013/04/how-to-use-vagrant-to-set-up-a-virtual-hadoop-cluster/" target="_blank"&gt;http://blog.cloudera.com/blog/2013/04/how-to-use-vagrant-to-set-up-a-virtual-hadoop-cluster/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Krishnan Raman of Twitter presented on using Scalding (a scala DSL for Cascading) and Algebird (Twitter&amp;#8217;s open-source abstract algebra framework) at BigData TechCon in Boston. In addition to the slides, the code and materials for the presentation have been posted to github.&lt;br/&gt;&lt;a href="https://github.com/krishnanraman/bigdata/blob/master/ProgrammingScaldingAlgebird.pdf?raw=true" target="_blank"&gt;&lt;a href="https://github.com/krishnanraman/bigdata/blob/master/ProgrammingScaldingAlgebird.pdf?raw=true" target="_blank"&gt;https://github.com/krishnanraman/bigdata/blob/master/ProgrammingScaldingAlgebird.pdf?raw=true&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="https://github.com/krishnanraman/bigdata" target="_blank"&gt;&lt;a href="https://github.com/krishnanraman/bigdata" target="_blank"&gt;https://github.com/krishnanraman/bigdata&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Splout is a SQL data store that includes tight-coupling with Hadoop and is suitable for serving real-time, web-scale traffic. This post is the 3rd in a series (the first two covered Hive and Cascading), and it covers loading data into Splout from Pig.&lt;br/&gt;&lt;a href="http://www.datasalt.com/2013/04/pig-splout-sql-for-a-retail-coupon-generator-a-big-data-love-story/" target="_blank"&gt;&lt;a href="http://www.datasalt.com/2013/04/pig-splout-sql-for-a-retail-coupon-generator-a-big-data-love-story/" target="_blank"&gt;http://www.datasalt.com/2013/04/pig-splout-sql-for-a-retail-coupon-generator-a-big-data-love-story/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; HCatalog provides access to Hive&amp;#8217;s metadata to other portions of the Hadoop stack (e.g. MapReduce and Pig) as well as via REST. This allows HCatalog to act as a glue between many different components in the stack. This blog post has a great overview of HCatalog&amp;#8217;s features and benefits.&lt;br/&gt;&lt;a href="http://hortonworks.com/blog/hivehcatalog-data-geeks-big-data-glue/" target="_blank"&gt;&lt;a href="http://hortonworks.com/blog/hivehcatalog-data-geeks-big-data-glue/" target="_blank"&gt;http://hortonworks.com/blog/hivehcatalog-data-geeks-big-data-glue/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; AirBnB has followed up their recent post about Chronos, their workflow and scheduling software, with an overview of their big data stack. They&amp;#8217;re using Storm and Hadoop, in addition to Chronos, on a single Mesos cluster. They have some information about each and promise a follow up post with more details.&lt;br/&gt;&lt;a href="http://nerds.airbnb.com/distributed-computing-at-airbnb" target="_blank"&gt;&lt;a href="http://nerds.airbnb.com/distributed-computing-at-airbnb" target="_blank"&gt;http://nerds.airbnb.com/distributed-computing-at-airbnb&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;News&lt;/strong&gt;&lt;br/&gt; The Hadoop Summit selection committees have created the initial program for Hadoop Summit taking place this June in San Jose. More sessions will be posted over the coming weeks.&lt;br/&gt;&lt;a href="http://hadoopsummit.org/san-jose/program/" target="_blank"&gt;&lt;a href="http://hadoopsummit.org/san-jose/program/" target="_blank"&gt;http://hadoopsummit.org/san-jose/program/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; The Call for Proposals is open through May 16th for Strata + Hadoop World. The conference takes place in New York in October, and covers big data, data science, and pervasive computing. Proposals for 40-minute sessions as well as 3-hour tutorials are accepted for one of these topics.&lt;br/&gt;&lt;a href="http://strataconf.com/stratany2013/public/cfp/264" target="_blank"&gt;&lt;a href="http://strataconf.com/stratany2013/public/cfp/264" target="_blank"&gt;http://strataconf.com/stratany2013/public/cfp/264&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Releases&lt;/strong&gt;&lt;br/&gt; A few weeks ago, Wibidata announced version 1.0.0 of KijiSchema. KijiSchema is a data management system atop of Apache HBase focussed on real-time retrieval of diverse datasets. The 1.0.0 version marks a commitment to maintaining API compatibility going forward.&lt;br/&gt;&lt;a href="http://www.kiji.org/2013/04/02/announcing-kijischema-1-0-0/" target="_blank"&gt;&lt;a href="http://www.kiji.org/2013/04/02/announcing-kijischema-1-0-0/" target="_blank"&gt;http://www.kiji.org/2013/04/02/announcing-kijischema-1-0-0/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Also announced a few weeks ago was the general availability of Platfora. Unlike other solutions that provide either an SQL interface or focus on BI tools, Platfora is trying to do both. A user asks a question via a web UI, and Platfora imports and caches data via MapReduce jobs in order to find an answer.&lt;br/&gt;&lt;a href="http://www.platfora.com/hadoop-ecosystem-blog/" target="_blank"&gt;&lt;a href="http://www.platfora.com/hadoop-ecosystem-blog/" target="_blank"&gt;http://www.platfora.com/hadoop-ecosystem-blog/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Events &lt;/strong&gt;(curated by Mortar Data)&lt;br/&gt; Tuesday, April 16&lt;br/&gt; Amazon Elastic Map Reduce - Hadoop Cloud Service (Hamilton Township, NJ)&lt;br/&gt;&lt;a href="http://www.meetup.com/nj-hadoop/" target="_blank"&gt;&lt;a href="http://www.meetup.com/nj-hadoop/" target="_blank"&gt;http://www.meetup.com/nj-hadoop/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Tuesday, April 16&lt;br/&gt; St. Louis Hadoop Users Group Meetup (Saint Louis, MO)&lt;br/&gt;&lt;a href="http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/110159842/" target="_blank"&gt;&lt;a href="http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/110159842/" target="_blank"&gt;http://www.meetup.com/St-Louis-Hadoop-Users-Group/events/110159842/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Wednesday, April 17&lt;br/&gt; Automating the Hadoop Stack with Chef (San Diego, CA)&lt;br/&gt;&lt;a href="http://www.meetup.com/sd-hug/events/112475312/" target="_blank"&gt;&lt;a href="http://www.meetup.com/sd-hug/events/112475312/" target="_blank"&gt;http://www.meetup.com/sd-hug/events/112475312/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Thursday, April 18&lt;br/&gt; A recommendation system and MapReduce (New York, NY)&lt;br/&gt;&lt;a href="http://www.meetup.com/NYC-Machine-Learning/" target="_blank"&gt;&lt;a href="http://www.meetup.com/NYC-Machine-Learning/" target="_blank"&gt;http://www.meetup.com/NYC-Machine-Learning/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Thursday, April 18&lt;br/&gt; Hadoop 2.0: What&amp;#8217;s coming? (Toronto)&lt;br/&gt;&lt;a href="http://www.meetup.com/TorontoHUG/events/112153292/" target="_blank"&gt;&lt;a href="http://www.meetup.com/TorontoHUG/events/112153292/" target="_blank"&gt;http://www.meetup.com/TorontoHUG/events/112153292/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Thursday, April 18&lt;br/&gt; Big Data in the AWS Cloud + More (Norwich, UK)&lt;br/&gt;&lt;a href="http://www.syncnorwich.com/events/110574642/?eventId=110574642&amp;amp;action=detail" target="_blank"&gt;&lt;a href="http://www.syncnorwich.com/events/110574642/?eventId=110574642&amp;amp;action=detail" target="_blank"&gt;http://www.syncnorwich.com/events/110574642/?eventId=110574642&amp;amp;action=detail&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Friday, April 19&lt;br/&gt; Big Hadoop Jobs on AWS (Munich, Germany)&lt;br/&gt;&lt;a href="http://www.meetup.com/Hadoop-User-Group-Munich/events/102940592/" target="_blank"&gt;&lt;a href="http://www.meetup.com/Hadoop-User-Group-Munich/events/102940592/" target="_blank"&gt;http://www.meetup.com/Hadoop-User-Group-Munich/events/102940592/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Friday, April 19&lt;br/&gt; Hadoop At Spotify (Kraków, Poland)&lt;br/&gt;&lt;a href="http://www.meetup.com/datakrk/events/113175722/" target="_blank"&gt;&lt;a href="http://www.meetup.com/datakrk/events/113175722/" target="_blank"&gt;http://www.meetup.com/datakrk/events/113175722/&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;</description><link>http://blog.mortardata.com/post/48035373999</link><guid>http://blog.mortardata.com/post/48035373999</guid><pubDate>Mon, 15 Apr 2013 07:57:00 -0400</pubDate><category>Hadoop Weekly</category><category>hadoop</category><category>apache pig</category></item><item><title>Data Science at Tumblr</title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;span&gt;&lt;/span&gt;&lt;a href="http://mortardata.com/team" target="_self"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_self"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p class="p1"&gt;Our second &lt;a href="http://www.meetup.com/NYC-Data-Science/" target="_blank"&gt;NYC Data Science Meetup&lt;/a&gt; featured Tumblr data scientist &lt;a href="https://twitter.com/adamlaiacano" target="_blank"&gt;Adam Laiacano&lt;/a&gt;, who discussed the analytics stack at Tumblr and the tools he and his team use to organize and analyze data. &lt;/p&gt;
&lt;p class="p1"&gt;Here are the video and slides from Adam&amp;#8217;s talk, which cover Tumblr&amp;#8217;s use of &lt;span&gt;Scribe, Hive &amp;amp; Pig, Hue, and Vowpal Wabbit&lt;/span&gt;:&lt;/p&gt;
&lt;p class="p1"&gt;&lt;!-- more --&gt;&lt;/p&gt;
&lt;p&gt;&lt;iframe frameborder="0" height="281" src="http://player.vimeo.com/video/63656541" width="500"&gt;&lt;/iframe&gt;&lt;/p&gt;
&lt;p&gt;&lt;iframe frameborder="0" height="400" marginheight="0" marginwidth="0" scrolling="no" src="http://www.slideshare.net/slideshow/embed_code/18411779" width="476"&gt;&lt;/iframe&gt;&lt;/p&gt;
&lt;p class="p2"&gt;&lt;span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p class="p2"&gt;&lt;span&gt;Big thanks to Adam for his talk and to great group of data enthusiasts that attended.&lt;br/&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/div&gt;</description><link>http://blog.mortardata.com/post/47549853491</link><guid>http://blog.mortardata.com/post/47549853491</guid><pubDate>Tue, 09 Apr 2013 13:41:22 -0400</pubDate><category>NYC Data Science Meetup</category><category>data science</category><category>pig</category><category>hive</category></item><item><title>Hadoop Weekly - April 8, 2013</title><description>&lt;p&gt;&lt;div class="author"&gt;
&lt;h3&gt;&lt;span&gt;&lt;/span&gt;&lt;a href="http://mortardata.com/team" target="_self"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_self"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;div class="copy"&gt;&lt;em&gt;Hadoop Weekly is a recurring guest post by &lt;a href="http://www.crobak.org/" target="_blank"&gt;Joe Crobak&lt;/a&gt;&lt;/em&gt;&lt;em&gt;. Joe is a software engineer on Foursquare’s big data team, where he focuses on Hadoop and analytics. You can follow Joe on Twitter at &lt;a href="https://twitter.com/joecrobak" target="_blank"&gt;@joecrobak&lt;/a&gt;.&lt;/em&gt;&lt;/div&gt;
&lt;div class="copy"&gt;&lt;span&gt; &lt;/span&gt;&lt;/div&gt;
&lt;div class="copy"&gt;&lt;span&gt;Happy 7th birthday to Apache Hadoop! The first release of Hadoop was made in April 2006. This week&amp;#8217;s newsletter caps that anniversary by representing many parts of the Hadoop ecosystem. It&amp;#8217;s quite impressive how far the project and the ecosystem have come in those 7 short years!&lt;/span&gt;&lt;/div&gt;
&lt;div class="copy"&gt;&lt;br/&gt;&lt;strong&gt;News&lt;/strong&gt;&lt;br/&gt; April 2nd marked the 7-year anniversary of the first release of Apache Hadoop. In this post, Doug Cutting (the founder of Hadoop) provides 7 thoughts and predictions for Hadoop. He touches everything from open-source, to the name of the project, to where he sees Hadoop heading in the next 7 years.&lt;br/&gt;&lt;a href="http://blog.cloudera.com/blog/2013/04/seven-thoughts-on-hadoops-seventh-birthday/" target="_blank"&gt;&lt;a href="http://blog.cloudera.com/blog/2013/04/seven-thoughts-on-hadoops-seventh-birthday/" target="_blank"&gt;http://blog.cloudera.com/blog/2013/04/seven-thoughts-on-hadoops-seventh-birthday/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;!-- more --&gt;&lt;br/&gt; The folks at LiveRamp have come up with a clever technique to speed up joins/cogroups by filtering map-side using Bloom filters. If you haven&amp;#8217;t seen Bloom filters before, the post explains their usefulness in this context. With this technique, they see performance improvements of 2x for a large job. They have open-sourced an implementation of this technique for Cascading.&lt;br/&gt;&lt;a href="http://blog.liveramp.com/2013/04/03/bloomjoin-bloomfilter-cogroup/" target="_blank"&gt;&lt;a href="http://blog.liveramp.com/2013/04/03/bloomjoin-bloomfilter-cogroup/" target="_blank"&gt;http://blog.liveramp.com/2013/04/03/bloomjoin-bloomfilter-cogroup/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Videos and slides from Hadoop Summit EU are beginning to arrive online. Hortonworks highlights the keynotes from the events which include presentations from 451 Research, Hortonworks, and a panel featuring HSBC, eBay and others.  You can find many more talks (and more being added every week) on the Hadoop Summit YouTube page, too.&lt;br/&gt;&lt;a href="http://hortonworks.com/blog/keynotes-from-hadoop-summit-amsterdam-2013/" target="_blank"&gt;&lt;a href="http://hortonworks.com/blog/keynotes-from-hadoop-summit-amsterdam-2013/" target="_blank"&gt;http://hortonworks.com/blog/keynotes-from-hadoop-summit-amsterdam-2013/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="http://www.youtube.com/user/HadoopSummit?feature=watch" target="_blank"&gt;&lt;a href="http://www.youtube.com/user/HadoopSummit?feature=watch" target="_blank"&gt;http://www.youtube.com/user/HadoopSummit?feature=watch&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; YARN is the new resource scheduler in Hadoop 2.0 for building applications other than vanilla MapReduce on a Hadoop cluster. Josh Patterson has started a new open-source project called Metronome built upon YARN. The software is based upon former projects IterativeReduce and Knitting Boar, and it provides an implementation of parallel linear regression.&lt;br/&gt;&lt;a href="http://www.slideshare.net/jpatanooga/hadoop-summit-eu-2013-parallel-linear-regression-iterativereduce-and-yarn" target="_blank"&gt;&lt;a href="http://www.slideshare.net/jpatanooga/hadoop-summit-eu-2013-parallel-linear-regression-iterativereduce-and-yarn" target="_blank"&gt;http://www.slideshare.net/jpatanooga/hadoop-summit-eu-2013-parallel-linear-regression-iterativereduce-and-yarn&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="https://github.com/jpatanooga/Metronome" target="_blank"&gt;&lt;a href="https://github.com/jpatanooga/Metronome" target="_blank"&gt;https://github.com/jpatanooga/Metronome&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; VMWare has released version 0.8.0 of Serengeti, their open-source initiative to improve Hadoop for virtualization. This release includes support for CDH4 and MapR&amp;#8217;s distributions as well as improved support for HBase.&lt;br/&gt;&lt;a href="http://blogs.vmware.com/vfabric/2013/04/new-serengeti-release-extends-cloud-computing-support-for-hadoop-community.html" target="_blank"&gt;&lt;a href="http://blogs.vmware.com/vfabric/2013/04/new-serengeti-release-extends-cloud-computing-support-for-hadoop-community.html" target="_blank"&gt;http://blogs.vmware.com/vfabric/2013/04/new-serengeti-release-extends-cloud-computing-support-for-hadoop-community.html&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Luigi is an open-source Hadoop framework from the folks at Spotify. We&amp;#8217;ve been using it at Foursquare for a few months and really like it. In this presentation, Elias gives an overview of Luigi as well as the evolution of Spotify&amp;#8217;s thinking about workflow management which explains how they arrived at Luigi.&lt;br/&gt;&lt;a href="http://www.slideshare.net/EliasFreider/luigi-pydata-presentation" target="_blank"&gt;&lt;a href="http://www.slideshare.net/EliasFreider/luigi-pydata-presentation" target="_blank"&gt;http://www.slideshare.net/EliasFreider/luigi-pydata-presentation&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="https://github.com/spotify/luigi" target="_blank"&gt;&lt;a href="https://github.com/spotify/luigi" target="_blank"&gt;https://github.com/spotify/luigi&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Datanami has a good summary of Eric Baldeschwieler&amp;#8217;s (aka Eric14) keynote from Hadoop Summit. The synopsis includes Eric&amp;#8217;s views on the future of Hadoop, from scaling to 10,000 nodes to lots of younger projects in the Hadoop ecosystem like HCatalog, Ambari, Tez, and more.&lt;br/&gt;&lt;a href="http://www.datanami.com/datanami/2013-04-03/baldeschwieler:_looking_at_the_future_of_hadoop.html" target="_blank"&gt;&lt;a href="http://www.datanami.com/datanami/2013-04-03/baldeschwieler:_looking_at_the_future_of_hadoop.html" target="_blank"&gt;http://www.datanami.com/datanami/2013-04-03/baldeschwieler:_looking_at_the_future_of_hadoop.html&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Microsoft has posted an in-depth analysis of how they use Hadoop on Azure (HDInsight) with Halo 4. The case study includes everything from analytics and BI to email targeting. It&amp;#8217;s a pretty interesting and impressive analysis considering that just a few months ago Hadoop didn&amp;#8217;t run on Windows at all.&lt;br/&gt;&lt;a href="http://www.microsoft.com/enterprise/it-trends/big-data/articles/Changing-the-Game-Halo-4-Team-Gets-New-User-Insights-from-Big-Data-in-the-Cloud.aspx#fbid=OAmTkNNsaBu" target="_blank"&gt;&lt;a href="http://www.microsoft.com/enterprise/it-trends/big-data/articles/Changing-the-Game-Halo-4-Team-Gets-New-User-Insights-from-Big-Data-in-the-Cloud.aspx#fbid=OAmTkNNsaBu" target="_blank"&gt;http://www.microsoft.com/enterprise/it-trends/big-data/articles/Changing-the-Game-Halo-4-Team-Gets-New-User-Insights-from-Big-Data-in-the-Cloud.aspx#fbid=OAmTkNNsaBu&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Falcon is a new Apache Incubator project from the folks at InMobi and Hortonworks focussing on ETL. It has a number of use cases, such as disaster recovery, multi-cluster management, and SLA management. It seems to have some overlap with existing projects (e.g. Oozie or Sqoop) but is focused on just ETL within or between Hadoop clusters so far.&lt;br/&gt;&lt;a href="http://hortonworks.com/blog/project-falcon-tackling-hadoop-data-lifecycle-management-via-community-driven-open-source/" target="_blank"&gt;&lt;a href="http://hortonworks.com/blog/project-falcon-tackling-hadoop-data-lifecycle-management-via-community-driven-open-source/" target="_blank"&gt;http://hortonworks.com/blog/project-falcon-tackling-hadoop-data-lifecycle-management-via-community-driven-open-source/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="http://www.inmobi.com/inmobiblog/2013/04/02/inmobi-works-with-hortonworks-to-incubate-falcon-with-apache-software-foundation-to-provide-huge-benefits-to-the-big-data-community/" target="_blank"&gt;&lt;a href="http://www.inmobi.com/inmobiblog/2013/04/02/inmobi-works-with-hortonworks-to-incubate-falcon-with-apache-software-foundation-to-provide-huge-benefits-to-the-big-data-community/" target="_blank"&gt;http://www.inmobi.com/inmobiblog/2013/04/02/inmobi-works-with-hortonworks-to-incubate-falcon-with-apache-software-foundation-to-provide-huge-benefits-to-the-big-data-community/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Did you know that Windows Azure can run Linux VMs? This tutorial shows how to build a Linux (CentOS) Hadoop cluster in Windows Azure. After booting a Windows Server for DNS, the rest of the tutorial focuses on Hadoop (they use HDP 1.2.2) on Linux.&lt;br/&gt;&lt;a href="http://blogs.msdn.com/b/benjguin/archive/2013/04/05/how-to-install-hadoop-on-windows-azure-linux-virtual-machines.aspx" target="_blank"&gt;&lt;a href="http://blogs.msdn.com/b/benjguin/archive/2013/04/05/how-to-install-hadoop-on-windows-azure-linux-virtual-machines.aspx" target="_blank"&gt;http://blogs.msdn.com/b/benjguin/archive/2013/04/05/how-to-install-hadoop-on-windows-azure-linux-virtual-machines.aspx&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; April Fools&amp;#8217; Day was this week, and there were a few fake Hadoop-related product announcements. Here are a couple in case you missed them.&lt;br/&gt;&lt;a href="http://www.hadoopsphere.com/2013/04/yas-1000x-faster-sql-on-hadoop-engine.html" target="_blank"&gt;&lt;a href="http://www.hadoopsphere.com/2013/04/yas-1000x-faster-sql-on-hadoop-engine.html" target="_blank"&gt;http://www.hadoopsphere.com/2013/04/yas-1000x-faster-sql-on-hadoop-engine.html&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="http://www.wibidata.com/blog/real-time-is-reckless-slow-and-steady-wins-the-race" target="_blank"&gt;&lt;a href="http://www.wibidata.com/blog/real-time-is-reckless-slow-and-steady-wins-the-race" target="_blank"&gt;http://www.wibidata.com/blog/real-time-is-reckless-slow-and-steady-wins-the-race&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;strong&gt;Releases&lt;/strong&gt;&lt;br/&gt;Apache Pig 0.11.1 was released. This update includes fixes to Avro, HCatalog, and HBase integrations (and more) as well as improvements including documentation polish.&lt;br/&gt;&lt;a href="http://pig.apache.org/releases.html#1+April%2C+2013%3A+release+0.11.1+available" target="_blank"&gt;&lt;a href="http://pig.apache.org/releases.html#1+April%2C+2013%3A+release+0.11.1+available" target="_blank"&gt;http://pig.apache.org/releases.html#1+April%2C+2013%3A+release+0.11.1+available&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; KairosDB is a rewrite of OpenTSDB (the time series database and visualization system from StumpleUpon) with a pluggable backend (defaults to Cassandra but also supports HBase and H2). KairosDB uses Flot for visualization and provides REST APIs for retrieving data. The release on their website is 1.0.0-alpha-4a, so I assume it&amp;#8217;s still considered alpha quality.&lt;br/&gt;&lt;a href="http://nosql.mypopescu.com/post/47102531877/kairosdb-fast-scalable-time-series-database" target="_blank"&gt;&lt;a href="http://nosql.mypopescu.com/post/47102531877/kairosdb-fast-scalable-time-series-database" target="_blank"&gt;http://nosql.mypopescu.com/post/47102531877/kairosdb-fast-scalable-time-series-database&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="https://code.google.com/p/kairosdb/" target="_blank"&gt;&lt;a href="https://code.google.com/p/kairosdb/" target="_blank"&gt;https://code.google.com/p/kairosdb/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Hama is a computing system on top of HDFS that is specialized for matrix and graph problems. Version 0.6.1, which includes improvements, bug fixes and new features, was released this week.&lt;br/&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/hadoop-general/201304.mbox/%3CCAGQgZQQ1x3w5tRB3eVs-ZNdsBKGz5Qdwy%3DhW5JOOjcmadOUC6Q%40mail.gmail.com%3E" target="_blank"&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/hadoop-general/201304.mbox/%3CCAGQgZQQ1x3w5tRB3eVs-ZNdsBKGz5Qdwy%3DhW5JOOjcmadOUC6Q%40mail.gmail.com%3E" target="_blank"&gt;http://mail-archives.apache.org/mod_mbox/hadoop-general/201304.mbox/%3CCAGQgZQQ1x3w5tRB3eVs-ZNdsBKGz5Qdwy%3DhW5JOOjcmadOUC6Q%40mail.gmail.com%3E&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%220.6.1%22%20AND%20project%20%3D%20HAMA" target="_blank"&gt;&lt;a href="https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%220.6.1%22%20AND%20project%20%3D%20HAMA" target="_blank"&gt;https://issues.apache.org/jira/issues/?jql=fixVersion%20%3D%20%220.6.1%22%20AND%20project%20%3D%20HAMA&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Cloudera Manager 4.5.1 was released. This bug fix release contains fixes for HDFS, MapReduce, and Hive.&lt;br/&gt;&lt;a href="https://groups.google.com/a/cloudera.org/d/msg/cdh-user/qYgMASROJWQ/ctFfmzE1TF0J" target="_blank"&gt;&lt;a href="https://groups.google.com/a/cloudera.org/d/msg/cdh-user/qYgMASROJWQ/ctFfmzE1TF0J" target="_blank"&gt;https://groups.google.com/a/cloudera.org/d/msg/cdh-user/qYgMASROJWQ/ctFfmzE1TF0J&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.5.1/Cloudera-Manager-Enterprise-Edition-4.5.x-Release-Notes/Cloudera-Manager-Enterprise-Edition-4.html" target="_blank"&gt;&lt;a href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.5.1/Cloudera-Manager-Enterprise-Edition-4.5.x-Release-Notes/Cloudera-Manager-Enterprise-Edition-4.html" target="_blank"&gt;http://www.cloudera.com/content/cloudera-content/cloudera-docs/CM4Ent/4.5.1/Cloudera-Manager-Enterprise-Edition-4.5.x-Release-Notes/Cloudera-Manager-Enterprise-Edition-4.html&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Cloudera announced the Cloudera Developer Kit. The goal of the project is to make it easier to write applications on top of Hadoop. Unlike other high-level frameworks for Hadoop, CDK is focusing on all layers of the stack, not just MapReduce. For example, the data API is one of the first under development, and it focuses on easing the burden of implementing data integration services which would normally have to muck with the nuances of the HDFS APIs.&lt;br/&gt;&lt;a href="https://groups.google.com/a/cloudera.org/d/msg/cdh-user/xJV77baI4Ss/5oPZzcaIe7wJ" target="_blank"&gt;&lt;a href="https://groups.google.com/a/cloudera.org/d/msg/cdh-user/xJV77baI4Ss/5oPZzcaIe7wJ" target="_blank"&gt;https://groups.google.com/a/cloudera.org/d/msg/cdh-user/xJV77baI4Ss/5oPZzcaIe7wJ&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="https://github.com/cloudera/cdk" target="_blank"&gt;&lt;a href="https://github.com/cloudera/cdk" target="_blank"&gt;https://github.com/cloudera/cdk&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; ElasticSearch Hadoop is a set of libraries for MapReduce, Pig, Hive, and Cascading from the folks at ElasticSearch. They haven&amp;#8217;t yet made a release, so you have to build it yourself, but it includes a bunch of features.  The drivers support read from and write to ElasticSearch over REST/JSON, and they&amp;#8217;ve made the binary small and independent for easy integration (just add a single jar!).&lt;br/&gt;&lt;a href="https://github.com/elasticsearch/elasticsearch-hadoop" target="_blank"&gt;&lt;a href="https://github.com/elasticsearch/elasticsearch-hadoop" target="_blank"&gt;https://github.com/elasticsearch/elasticsearch-hadoop&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; CDH3u6 was released. This version includes a handful of fixes in MapReduce, Flume, and HBase.&lt;br/&gt;&lt;a href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH3/CDH3u6/CDH3-Release-Notes/CDH3-Release-Notes.html" target="_blank"&gt;&lt;a href="http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH3/CDH3u6/CDH3-Release-Notes/CDH3-Release-Notes.html" target="_blank"&gt;http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH3/CDH3u6/CDH3-Release-Notes/CDH3-Release-Notes.html&lt;/a&gt;&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;&lt;/p&gt;</description><link>http://blog.mortardata.com/post/47449248771</link><guid>http://blog.mortardata.com/post/47449248771</guid><pubDate>Mon, 08 Apr 2013 07:03:00 -0400</pubDate><category>Hadoop Weekly</category><category>hadoop</category></item><item><title>Hadoop Weekly - April 1, 2013</title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;span&gt;&lt;/span&gt;&lt;a href="http://mortardata.com/team" target="_self"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_self"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p&gt;&lt;em&gt;Hadoop Weekly is a new (recurring) guest post by &lt;a href="http://www.crobak.org/" target="_blank"&gt;Joe Crobak&lt;/a&gt;&lt;/em&gt;&lt;em&gt;.  Joe is a software engineer on Foursquare&amp;#8217;s big data team, where he focuses on Hadoop and analytics.  You can follow Joe on Twitter at &lt;a href="https://twitter.com/joecrobak" target="_blank"&gt;@joecrobak&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;&lt;strong&gt;News&lt;/strong&gt;&lt;/span&gt;&lt;br/&gt;Apache Hadoop&amp;#8217;s Distributed File System and MapReduce were originally based upon research papers written by Google. Google owns a number of patents in these spaces, including 10 related to MapReduce. This week, they pledged &amp;#8220;not to sue any user, distributor or developer of open-source software on specified patents, unless first attacked.&amp;#8221;&lt;br/&gt;&lt;a href="http://google-opensource.blogspot.com/2013/03/taking-stand-on-open-source-and-patents.html" target="_blank"&gt;&lt;a href="http://google-opensource.blogspot.com/2013/03/taking-stand-on-open-source-and-patents.html" target="_blank"&gt;http://google-opensource.blogspot.com/2013/03/taking-stand-on-open-source-and-patents.html&lt;/a&gt;&lt;/a&gt;&lt;br/&gt; &lt;!-- more --&gt;&lt;br/&gt; MapR and Canonical announced that MapR&amp;#8217;s M3 Hadoop distribution will be an integrated offering into Ubuntu 12.04 LTS and 12.10 via the Ubuntu Partner Portal.&lt;br/&gt;&lt;a href="http://www.businesswire.com/news/home/20130328005139/en/MapR-Teams-Canonical-Deliver-Hadoop-Ubuntu" target="_blank"&gt;&lt;a href="http://www.businesswire.com/news/home/20130328005139/en/MapR-Teams-Canonical-Deliver-Hadoop-Ubuntu" target="_blank"&gt;http://www.businesswire.com/news/home/20130328005139/en/MapR-Teams-Canonical-Deliver-Hadoop-Ubuntu&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;/a&gt;MapR also made some new this week by open-sourcing their forks of a number of projects in the Hadoop stack (but not MapR FS). The list includes, sqoop, pig, mahout, hive, hbase, oozie, opentsdb, scribe and more. Some of these projects haven&amp;#8217;t been updated in a year (scribe) but the majority were updated in the past month.&lt;br/&gt;&lt;a href="http://www.businesswire.com/news/home/20130328005137/en/MapR-Source-Code-GitHub" target="_blank"&gt;&lt;a href="http://www.businesswire.com/news/home/20130328005137/en/MapR-Source-Code-GitHub" target="_blank"&gt;http://www.businesswire.com/news/home/20130328005137/en/MapR-Source-Code-GitHub&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="https://github.com/mapr/" target="_blank"&gt;&lt;a href="https://github.com/mapr/" target="_blank"&gt;https://github.com/mapr/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Netflix has an in-depth blog post about building recommendation systems. They discuss the three types of systems that they use &amp;#8212; offline, nearline, and online, as well as the trade-offs and design decisions you have to consider for each. The post contains a number of detailed system diagrams with thorough explanations about the systems that they use from hadoop to cassandra to mysql.&lt;br/&gt;&lt;a href="http://techblog.netflix.com/2013/03/system-architectures-for.html" target="_blank"&gt;&lt;a href="http://techblog.netflix.com/2013/03/system-architectures-for.html" target="_blank"&gt;http://techblog.netflix.com/2013/03/system-architectures-for.html&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; GIS Tools is a set of open-source tools for spatial data analytics on Hadoop from the folks at Esri. It includes java libraries for integration into MapReduce as well as lots of Hive UDFs for spatial and geometric processing.&lt;br/&gt;&lt;a href="http://blogs.esri.com/esri/arcgis/2013/03/25/gis-tools-for-hadoop/" target="_blank"&gt;&lt;a href="http://blogs.esri.com/esri/arcgis/2013/03/25/gis-tools-for-hadoop/" target="_blank"&gt;http://blogs.esri.com/esri/arcgis/2013/03/25/gis-tools-for-hadoop/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;a href="http://esri.github.com/gis-tools-for-hadoop/" target="_blank"&gt;&lt;a href="http://esri.github.com/gis-tools-for-hadoop/" target="_blank"&gt;http://esri.github.com/gis-tools-for-hadoop/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; HUE is an open-source web interface to Hadoop, which includes a number of applications such as an HDFS browser and a web-based interface into Hive called Beeswax. This tutorial describes loading tweet data into HDFS, creating a table for that data in Hive, and running an analysis of the data using Hive.&lt;br/&gt;&lt;a href="http://blog.cloudera.com/blog/2013/03/how-to-analyze-twitter-data-with-hue/" target="_blank"&gt;&lt;a href="http://blog.cloudera.com/blog/2013/03/how-to-analyze-twitter-data-with-hue/" target="_blank"&gt;http://blog.cloudera.com/blog/2013/03/how-to-analyze-twitter-data-with-hue/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Microsoft&amp;#8217;s HDInsight platform includes a developer edition for local development and testing. In this post, the author describes the process of setting up a local development environment on Windows, writing a map reduce job using C#, running your job on the cluster, and loading the data into Hive.&lt;br/&gt;&lt;a href="http://www.amazedsaint.com/2013/03/taming-big-data-with-c-using-hadoop-on.html" target="_blank"&gt;&lt;a href="http://www.amazedsaint.com/2013/03/taming-big-data-with-c-using-hadoop-on.html" target="_blank"&gt;http://www.amazedsaint.com/2013/03/taming-big-data-with-c-using-hadoop-on.html&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Matt Walker presented on Etsy&amp;#8217;s data stack at Data Day Texas. He covers the evolution of their offline infrastructure over several years, systems that they&amp;#8217;ve built, and the tools &amp;amp; frameworks that they use for offline analytics.&lt;br/&gt;&lt;a href="http://www.slideshare.net/mwalkerinfo/data-daytexas2013" target="_blank"&gt;&lt;a href="http://www.slideshare.net/mwalkerinfo/data-daytexas2013" target="_blank"&gt;http://www.slideshare.net/mwalkerinfo/data-daytexas2013&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; The Call for Speakers for HBaseCon 2013 ends today (4/1). Cloudera interviewed some members of the HBaseCon Program Committee about HBase and HBaseCon.&lt;br/&gt;&lt;a href="http://blog.cloudera.com/blog/2013/03/meet-the-hbasecon-2013-program-committee/" target="_blank"&gt;&lt;a href="http://blog.cloudera.com/blog/2013/03/meet-the-hbasecon-2013-program-committee/" target="_blank"&gt;http://blog.cloudera.com/blog/2013/03/meet-the-hbasecon-2013-program-committee/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt; Debugging MapReduce is usually a matter of adding counters or System.out.println statements to determine what is happening. If you&amp;#8217;re developing on a local box, though, a simpler solution is to attach a debugger. This walkthrough has all you need to know about debugging a MapReduce job in IntelliJ.&lt;br/&gt;&lt;a href="http://vichargrave.com/debugging-hadoop-applications-with-intellij/" target="_blank"&gt;&lt;a href="http://vichargrave.com/debugging-hadoop-applications-with-intellij/" target="_blank"&gt;http://vichargrave.com/debugging-hadoop-applications-with-intellij/&lt;/a&gt;&lt;/a&gt;&lt;br/&gt;&lt;br/&gt;&lt;span&gt;&lt;strong&gt; Releases&lt;/strong&gt;&lt;/span&gt;&lt;br/&gt; Apache Oozie 3.3.2 was released with a number of improvements and bug fixes. Among the highlights are: uberjar support for MapReduce actions, improvements to the ooze web ui, and improvements to the command line for coordinators.&lt;br/&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/oozie-user/201303.mbox/%3CCAHz+ZFd2dmk4U8Z8M41QQS-4o8nQea6CfZ5uzLpuR3oz9XjAdA@mail.gmail.com%3E" target="_blank"&gt;&lt;a href="http://mail-archives.apache.org/mod_mbox/oozie-user/201303.mbox/%3CCAHz+ZFd2dmk4U8Z8M41QQS-4o8nQea6CfZ5uzLpuR3oz9XjAdA@mail.gmail.com%3E" target="_blank"&gt;http://mail-archives.apache.org/mod_mbox/oozie-user/201303.mbox/%3CCAHz+ZFd2dmk4U8Z8M41QQS-4o8nQea6CfZ5uzLpuR3oz9XjAdA@mail.gmail.com%3E&lt;/a&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;</description><link>http://blog.mortardata.com/post/46842031224</link><guid>http://blog.mortardata.com/post/46842031224</guid><pubDate>Mon, 01 Apr 2013 08:16:00 -0400</pubDate><category>Hadoop Weekly</category><category>hadoop</category></item><item><title>Dirty Secrets of Data Science</title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;span&gt;&lt;/span&gt;&lt;a href="http://mortardata.com/team" target="_self"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_self"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p&gt;Thanks to everyone who came out to our inaugural NYC Data Science Meetup.  For those who couldn&amp;#8217;t attend, &lt;a href="http://hilarymason.com/" target="_blank"&gt;Hilary Mason&lt;/a&gt; fought off jetlag and a tough cold to give a great presentation.&lt;/p&gt;
&lt;p class="mcePageBreak"&gt;Below is a 12-minute clip from Hilary&amp;#8217;s talk, which she called &amp;#8220;Dirty Secrets of Data Science.&amp;#8221;&lt;/p&gt;
&lt;!-- more --&gt;

&lt;p&gt;&lt;strong&gt;&lt;iframe frameborder="0" height="281" src="http://player.vimeo.com/video/60661618" width="500"&gt;&lt;/iframe&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;P.S. - If you&amp;#8217;re in NYC and haven&amp;#8217;t yet, make sure you &lt;a href="http://www.meetup.com/NYC-Data-Science/" target="_blank"&gt;join the NYC Data Science Meetup Group&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;</description><link>http://blog.mortardata.com/post/45192413994</link><guid>http://blog.mortardata.com/post/45192413994</guid><pubDate>Tue, 12 Mar 2013 11:35:00 -0400</pubDate><category>data science</category><category>NYC Data Science Meetup</category></item><item><title>Pig Eye for the SQL Guy</title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;span&gt;&lt;/span&gt;&lt;a href="http://mortardata.com/team" target="_self"&gt;Cat Miller&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_self"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Cat_Miller.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p&gt;For anyone who came of programming age before cloud computing burst its way into the technology scene, data analysis has long been synonymous with SQL. A slightly awkward, declarative language whose production can more resemble logic puzzle solving than coding, SQL and the relational databases it builds on have been the pervasive standard for how to deal with data.&lt;/p&gt;
&lt;p&gt;As the world has changed, so too has our data; an ever-increasing amount of data is now stored without a rigorous schema, or must be joined to outside data sets to be useful. Compounding this problem, often the amounts of data are so large that working with them on a traditional SQL database is so non-performant as to be impractical.&lt;/p&gt;
&lt;p&gt;Enter Pig, a SQL-like language that gracefully tolerates inconsistent schemas, and that runs on Hadoop. (Hadoop is a massively parallel platform for processing the largest of data sets in reasonable amounts of time. Hadoop powers Facebook, Yahoo, Twitter, and LinkedIn, to name a few in a growing list.)&lt;/p&gt;
&lt;p&gt;This then is a brief guide for the SQL developer diving into the waters of Pig Latin for the first time. Pig is similar enough to SQL to be familiar, but divergent enough to be disorienting to newcomers. The goal of this guide is to ease the friction in adding Pig to an existing SQL skillset.&lt;!-- more --&gt;&lt;/p&gt;
&lt;h3&gt;WHAT&amp;#8217;S SIMILAR?&lt;/h3&gt;
&lt;p&gt;&lt;span&gt;The basic concepts in SQL map pretty well onto Pig. There are analogues for the major SQL keywords, and as a result you can write a query in your head as SQL and then translate it into Pig Latin without undue mental gymnastics.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;span&gt;WHERE → FILTER&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;The syntax is different, but conceptually this is still putting your data into a funnel to create a smaller dataset.&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;span&gt;HAVING → FILTER&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Because a FILTER is done in a separate step from a GROUP or an aggregation, the distinction between HAVING and WHERE doesn’t exist in Pig.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;span&gt;ORDER BY → ORDER&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This keyword behaves pretty much the same in Pig as in SQL.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;&lt;span&gt;JOIN&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In Pig, joins can have their execution specified, and they look a little different, but in essence these are the same joins you know from SQL, and you can think about them in the same way. There are INNER and OUTER joins, RIGHT and LEFT specifications, and even CROSS for those rare moments that you actually want a Cartesian product.&lt;/p&gt;
&lt;p&gt;Because Pig is most appropriately used for data pipelines, there are often fewer distinct relations or tables than you would expect to see in a traditional normalized relational database.&lt;/p&gt;

&lt;h3&gt;CONTROL OVER EXECUTION&lt;/h3&gt;
&lt;p&gt;&lt;span&gt;SQL performance tuning generally involves some fiddling with indexes, punctuated by the occasional yelling at an explain plan that has inexplicably decided to join the two largest tables first. It can mean getting a different plan the second time you run a query, or having the plan suddenly change after several weeks of use because the statistics have evolved, throwing your query’s performance into the proverbial toilet.&lt;/span&gt;&lt;/p&gt;
&lt;div class="entry"&gt;
&lt;p&gt;Various SQL implementations offer hints to combat this problem—you can use a hint to tell your SQL optimizer that it should use an index, or to force a given table to be first in the join order. Unfortunately, because hints are dependent on the particular SQL implementation, what you actually have at your disposal varies by platform.&lt;/p&gt;
&lt;p&gt;Pig offers a few different ways to control the execution plan. The first is just the explicit ordering of operations. You can write your FILTER before your JOIN (the reverse of SQL’s order) and be clever about eliminating unused fields along the way, and have confidence that the executed order will not be worse.&lt;/p&gt;
&lt;p&gt;Secondly, the philosophy of Pig is to allow users to choose implementations where multiple ones are possible. As a result, there are three specialized joins that a can be used when the features of the data are known, and are less appropriate for a regular join. For regular joins, the order of the arguments dictates execution—the larger data set should appear last in this type of join.&lt;/p&gt;
&lt;p&gt;As with SQL, in Pig you can pretty much ignore the performance tweaks until you can’t. Because of the explicit control of ordering, it can be useful to have a general sense of the “good” order to do things in, though Pig’s optimizer will also try to push up FILTERs and LIMITs, taking some of the pressure off.&lt;/p&gt;
&lt;/div&gt;

&lt;h3&gt;WHAT&amp;#8217;S DIFFERENT?&lt;/h3&gt;
&lt;h5&gt;A Row By Any Other Name&lt;/h5&gt;
&lt;p&gt;The SQL paradigm is very straightforward—there are tables, and tables contain rows. Every select statement yields a set of rows, and each field in a row is a basic data type. Conceptually, the result of any SQL select statement can be imported into Excel without loss of information.&lt;/p&gt;
&lt;p&gt;Pig introduces a mature nesting notion in tuples and data bags that changes the game significantly. Pig consists of data sets called relations (sometimes called aliases for the names they are given), and those contain records that are data tuples, which can in turn recursively contain data bags, data tuples, or data items. There is distinct lack of flatness in Pig, and the best way to see it is to explore how GROUP works.&lt;/p&gt;
&lt;h5&gt;Not Your Grandmother’s GROUP&lt;/h5&gt;
&lt;p&gt;In the handling of GROUP, SQL and Pig diverge significantly. SQL’s GROUP doesn’t exist outside of the aggregation performed on it; you would never SELECT * GROUP by field1–it just doesn’t make sense. Because everything in SQL is a row, the grouping created isn’t persistent—only the data produced aggregating over it remains.&lt;/p&gt;
&lt;p&gt;Pig’s GROUP is an entirely different beast, albeit used for the same purpose. It is a persistent relation that can be used agan, independent of what aggregations you might choose to perform on it later.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;student_grades = GROUP grades by student;&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This student_grades relation has two fields: one called group and populated with the value of student, the other called grades populated with a data bag containing tuples for all the entries with the same value of student.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;group	grades
alyssa 	{&amp;lt; hacking101, 95 &amp;gt;, &amp;lt; english, 60 &amp;gt;}
ben	{&amp;lt; math, 90 &amp;gt;}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now to do an aggregation, perform it on the student_grades alias.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;average_grades = FOREACH student_grades GENERATE group, AVG(grades.value);&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Procedural Paradise&lt;/h5&gt;
&lt;p&gt;This is the first thing any Google search on Pig will tell you, and it is the most glaringly obvious change from SQL. After having taught your brain for years how to turn an idea inside-out and mash all of its pieces into one query, Pig makes query writing feel like writing Java or C++. In addition to to obvious potential cognitive benefits, this has some technical ones as well.&lt;/p&gt;
&lt;h5&gt;Subquery Reuse&lt;/h5&gt;
&lt;p&gt;Ever write a query with a subselect, and then realize you actually needed to use that subtable twice in the query? Did you feel absolutely awful as you cut and pasted that subtable? Did you wonder whether your SQL plan would successfully manage to not calculate it twice? (Note: the WITH clause mitigates this pain in a lot of cases, but isn’t available in all flavors of SQL.)&lt;/p&gt;
&lt;p&gt;Because in Pig Latin every step has a declared alias, reusing “subquery” tables is natural and intuitive, and generally does not involve building them twice.&lt;/p&gt;
&lt;h5&gt;Getting multiple queries out of one pipeline&lt;/h5&gt;
&lt;p&gt;In SQL you can find yourself in a place where you want to use the data, do some manipulation on it, and then take it in a few different directions. To do this in one query requires profligate use of JOIN, and enough parens to intimidate a LISP hacker.&lt;/p&gt;
&lt;p&gt;In a Pig pipeline, any and all aliases produced along the way can be stored, and all it takes is adding a new STORE statement to the script.&lt;/p&gt;
&lt;h5&gt;User-Defined Functions&lt;/h5&gt;
&lt;p&gt;SQL has had decades for people to figure out what analytic functions they need for arbitrary data analysis, and so when you find yourself suddenly interested in extracting the day of the week from a date, that function is ready and waiting.&lt;/p&gt;
&lt;p&gt;Pig’s list of built-in functions is growing, but is still dwarfed by what Oracle or MYSQL provides. What turns this into a tolerable constraint is that Pig allows the user to define aggregate or analytic functions in other languages (Java, Python, and others) and then apply them in Pig quickly and without fuss.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;REGISTER udf.jar;

new_data = FOREACH my_data GENERATE udf.ImportantFunction(field1);&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;The Well-Disguised SQLer&lt;/h3&gt;
&lt;p&gt;In general, if you can think about it in SQL, you can do it in Pig. Be aware of the nested data structures, have a cheat sheet for syntax, and relish the ability to write queries the way your brain thinks them, and not the way SQL demands.&lt;/p&gt;
&lt;p&gt;As a final thought, let’s resurrect our old friend the emp table, and take a look at some SQL to Pig Latin examples.&lt;/p&gt;
&lt;h5&gt;Average Salary by Location&lt;/h5&gt;
&lt;h5&gt;SQL&lt;/h5&gt;
&lt;pre&gt;&lt;code&gt;SELECT loc, AVG(sal) FROM emp JOIN dept USING(deptno) WHERE sal &amp;gt; 3000 GROUP BY loc;&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Pig Latin&lt;/h5&gt;
&lt;pre&gt;&lt;code&gt;filtered_emp = FILTER emp BY sal &amp;gt; 3000;

emp_join_dept = JOIN filtered_emp BY deptno, dept BY deptno;

grouped_by_loc = GROUP emp_join_dept BY loc;

avg_salary = FOREACH grouped_by_loc GENERATE group, AVG(emp_join_dept.sal);&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Ordered Average Salary by Location&lt;/h5&gt;
&lt;p&gt;Suppose now that the following is true.&lt;/p&gt;
&lt;p&gt;● The ‘loc’ field is a string/varchar field, and we have two pieces of software that automatically populate it. One stores values as lowercase, one as uppercase. (If this seems like a contrived example to you, you have chosen your employers and software vendors well.)&lt;/p&gt;
&lt;p&gt;The new parts of the queries appear in bold.&lt;/p&gt;
&lt;h5&gt;SQL&lt;/h5&gt;
&lt;pre&gt;&lt;code&gt;SELECT standard_loc, AVG(sal) avg_salary FROM
(SELECT UPPER(loc) standard_loc, sal FROM emp JOIN dept USING(deptno) WHERE sal &amp;gt; 3000) std_table GROUP BY standard_loc;&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Pig Latin&lt;/h5&gt;
&lt;pre&gt;&lt;code&gt;filtered_emp = FILTER emp BY sal &amp;gt; 3000;

emp_join_dept = JOIN filtered_emp BY deptno, dept BY deptno;

grouped_by_loc = GROUP emp_join_dept BY UPPER(loc);

avg_salary = FOREACH grouped_by_loc GENERATE group, AVG(emp_join_dept.sal);&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This kind of change is friendlier to Pig because of the limitations in SQL’s GROUP BY clause; the trade-off of verboseness for clarity increases in value the more complex your query gets.&lt;/p&gt;
&lt;h5&gt;Above Average Salary for Location&lt;/h5&gt;
&lt;p&gt;Now suppose instead of arbitrarily selecting 3,000 as a threshold, which is going to overselect people living in large expensive cities, we want to select those employees who make more than twice the average for their location.&lt;/p&gt;
&lt;h5&gt;SQL&lt;/h5&gt;
&lt;p&gt;There are several ways to accomplish this, of course, but for illustration purposes the most vanilla version is shown here.&lt;/p&gt;
&lt;pre&gt;&lt;code&gt;SELECT empno FROM 
(SELECT empno, UPPER(loc) standard_loc, sal FROM emp JOIN dept USING(deptno)) std_table1 JOIN
(SELECT standard_loc, AVG(sal) avg_salary FROM
(SELECT UPPER(loc) standard_loc, sal FROM emp JOIN dept USING(deptno)) std_table2 GROUP BY standard_loc) grp_table2
USING(standard_loc) WHERE sal &amp;gt; (2 * avg_salary);&lt;/code&gt;&lt;/pre&gt;
&lt;h5&gt;Pig Latin&lt;/h5&gt;
&lt;pre&gt;&lt;code&gt;emp_join_dept = JOIN emp BY deptno, dept by deptno;

grouped_by_loc = GROUP emp_join_dept BY UPPER(loc);

loc_avg_salary = FOREACH grouped_by_loc GENERATE AVG(emp_join_dept.sal) as avg_salary, FLATTEN(emp_join_dept);

highly_paid = FILTER loc_avg_salary BY sal &amp;gt; (2 * avg_salary);&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Here the dreaded subquery reuse rears its unattractive head in SQL and leads to a bit of a frankenquery. In contrast, the Pig script stays the same length, because we can effectively just shift the FILTER from the beginning to the end of our data flow. FLATTEN is a new entry to our function arena, for which there is no SQL analogue. What FLATTEN does is unnest tuples and data bags; in this example it’s taking the data bag ‘emp_join_dept’ created by the GROUP function, and removing the nesting so that the fields within it will be at the same level as avg_salary.&lt;/p&gt;
&lt;h3&gt;Run It Yourself for the Swine-Curious&lt;/h3&gt;
&lt;p&gt;If you want to to try out some Pig examples hands-on, you can get a free Mortar account &lt;a href="https://app.mortardata.com/signup" target="_blank"&gt;here&lt;/a&gt;. We’ve generated a one million row emp table data set, so to run these examples on your own all you need to do is:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;Sign up at app.mortardata.com for a free account.&lt;/li&gt;
&lt;li&gt;Go to Web Projects&lt;/li&gt;
&lt;li&gt;Select My Web Projects -&amp;gt; New Blank project&lt;/li&gt;
&lt;li&gt;To load the two data files, you need LOAD statements, and to store the resulting data you need a STORE statement, so in the end your pig script should look like this:
&lt;pre&gt;dept = LOAD 's3n://mortar-example-data/employee/dept.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER')
    AS (deptno:int, dname:chararray, loc:chararray);

emp = LOAD 's3n://mortar-example-data/employee/emp.csv' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',', 'NO_MULTILINE', 'NOCHANGE', 'SKIP_INPUT_HEADER')
    AS (empno:int, ename:chararray, job:chararray, mgr:int, hiredate:chararray, sal:float, deptno:int);

-- Pig example code here

STORE avg_salary INTO 's3n://mortar-example-output-data/$MORTAR_EMAIL_S3_ESCAPED/emp_example’ USING PigStorage('\t');
&lt;/pre&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span&gt;Click Run and select a 2-node cluster. Even with one million rows, it runs in under three minutes. Sit back and watch the power of modern computing tackle the problems of the 1980s.&lt;/span&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;/div&gt;</description><link>http://blog.mortardata.com/post/44311197563</link><guid>http://blog.mortardata.com/post/44311197563</guid><pubDate>Fri, 01 Mar 2013 15:16:00 -0500</pubDate><category>hadoop</category><category>sql</category><category>pig</category><category>piglatin</category></item><item><title>MongoDB + Hadoop: Why &amp; How</title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;a href="http://mortardata.com/team" target="_self"&gt;Jeremy Karn&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_self"&gt; &lt;img alt="image" src="http://mortardata.com/assets/team-blog/Jeremy_Karn.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p&gt;You have MongoDB, so you have this tremendously scalable database. You’re collecting a ton of data, but you know you need to do more with it (okay, a lot more). You think you want to use Hadoop, but it doesn’t sound easy.&lt;/p&gt;
&lt;p&gt;To keep it simple, we’ve divided the article into three parts:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;&lt;a href="#14022013-WHY" target="_self"&gt;&lt;strong&gt;&amp;#8220;WHY&amp;#8221;&lt;/strong&gt;&lt;/a&gt; explains the reasons for using Hadoop to process data stored in MongoDB&lt;/li&gt;
&lt;li&gt;&lt;a href="#14022013-HOW" target="_self"&gt;&lt;strong&gt;&amp;#8220;HOW&amp;#8221;&lt;/strong&gt;&lt;/a&gt; helps you get get set up&lt;/li&gt;
&lt;li&gt;&lt;a href="#14022013-DEMO" target="_self"&gt;&lt;strong&gt;&amp;#8220;DEMO&amp;#8221;&lt;/strong&gt;&lt;/a&gt; shows you MongoDB and Hadoop working together. If you’re a tldr; type, you’ll want to start with this section.&lt;/li&gt;
&lt;/ol&gt;&lt;h3&gt;&lt;!-- more --&gt;&lt;/h3&gt;
&lt;h3 id="14022013-WHY"&gt;WHY&lt;/h3&gt;
&lt;p&gt;Mongo was built for data storage and retrieval, and Hadoop was written for data processing. So naturally, data processing is often better offloaded to Hadoop. Here’s why:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Easier, more expressive languages&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;MongoDB supports native MapReduce, but MapReduce is a pain in the ass. The Hadoop community has created Pig, Hive, and the Cascading family of languages—all of which compile to MapReduce, but are expressive and high-level.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Libraries to build on&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Very few popular data processing libraries are written in Javascript, so you’ll often find yourself without access to the libraries you need, such as NumPy, NLTK, etc.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Big performance improvements&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Hadoop is purpose-built for fully distributed, multi-threaded execution of data processing, so it performs much, much better for this sort of work.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Separate workloads mean less load&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If you’re doing significant data processing on MongoDB, it can add substantial load. You may need an order of magnitude more power to process data than to store it, so it works really well to separate those concerns and separate those workloads.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;&lt;p&gt;So if you want to easily write distributed jobs that perform well and don’t add load to your primary storage system, Hadoop is probably the way to go.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Bonus:&lt;/strong&gt;&lt;/em&gt; We’ve included ready-to-use Hadoop code to extract the “schema” of your MongoDB, and characterize how that schema is used in the demo section.&lt;/p&gt;
&lt;h3 id="14022013-HOW"&gt;HOW&lt;/h3&gt;
&lt;p&gt;We used Mortar for this demo because it’s free for this purpose, and you won’t need to set up any infrastructure. Mortar is an open source framework for easily writing/developing your Hadoop jobs coupled with Hadoop-as-a-Service for running Mortar jobs on Hadoop clusters. If you’re going to use Mortar, &lt;a href="#14022013-DEMO" target="_self"&gt;skip to the DEMO section.&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Otherwise, you need:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Hadoop Cluster&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;There are many ways to run Hadoop, here are a couple:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Set up your own cluster:&lt;/strong&gt; You&amp;#8217;ll need some machines and you&amp;#8217;ll need to &lt;a href="http://hadoop.apache.org/docs/stable/cluster_setup.html" target="_blank"&gt;follow these instructions&lt;/a&gt;. &lt;em&gt;&lt;strong&gt;Warning&lt;/strong&gt;&lt;/em&gt;, this is not for the faint of heart, and probably should be reserved for companies with substantial resources and serious sys admin chops.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Use Amazon Elastic MapReduce (EMR):&lt;/strong&gt; EMR is an offering by Amazon Web Services (AWS) that allows you to run a Hadoop-based job on a Hadoop cluster in the cloud. Aside from all of the typical cloud benefits that you get from doing this, you also get to skip the setup and configuration of a Hadoop cluster. There’s a step-by-step guide to setting up EMR &lt;a href="http://www.zdnet.com/big-data-on-amazon-elastic-mapreduce-step-by-step-7000009361/" target="_blank"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Input Connector&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;To load your data from MongoDB into Pig, you’ll need the &lt;a href="https://github.com/mortardata/mongo-hadoop" target="_blank"&gt;Pig loader&lt;/a&gt;. Here&amp;#8217;s &lt;a href="http://help.mortardata.com/#!/mongodb" target="_blank"&gt;documentation on how you can use the loader&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Data Processing Language&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;In this demo you’ll process your data on Hadoop with &lt;a href="http://pig.apache.org/" target="_blank"&gt;Apache Pig&lt;/a&gt;, a high-level data flow language that compiles down into Hadoop MapReduce jobs. It was designed to be easy to learn and simple to write. If you’ve written SQL, Pig will feel familiar—it is like procedural SQL. For more details on Pig, check out &lt;a href="http://blog.mortardata.com/post/33711299619/8-reasons-you-should-be-using-apache-pig" target="_blank"&gt;“8 reasons you should be using Apache Pig”&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;If you don&amp;#8217;t want to write a Pig script and would prefer to stick with raw Hadoop MapReduce jobs, the &lt;a href="https://github.com/mortardata/mongo-hadoop" target="_blank"&gt;Mongo-Hadoop project&lt;/a&gt; will support that, but we won’t cover raw MapReduce in this article since it requires 10x more code without much benefit.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Output Connector&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;OK, now it’s time to choose an output destination for your processed data. You have a lot of options here:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;
&lt;p&gt;&lt;strong&gt;MongoDB&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If your want to write your final results back to MongoDB, the Mongo-Hadoop project also contains support for this with MongoStorage.&lt;/p&gt;
&lt;p&gt;For documentation on how to do this read the “Storing Data to MongoDB” section under &lt;a href="http://help.mortardata.com/#!/mongodb" target="_blank"&gt;“Using MongoDB Data”&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Amazon’s S3&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;One of the most useful output locations - If you&amp;#8217;re running your Hadoop job on a cluster that&amp;#8217;s running in AWS, then your data transfer will be extremely fast and you get all of the benefits of storing your data in the cloud.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;HDFS&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;If you&amp;#8217;re running your own long-lived cluster, you can write the results to the cluster’s distributed file system. This can be useful if you&amp;#8217;re going to be doing more processing on the data later.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;&lt;/li&gt;
&lt;/ol&gt;&lt;h3 id="14022013-DEMO"&gt;DEMO&lt;/h3&gt;
&lt;p&gt;The fun part! Here’s a quick step-by-step example that should take just a few minutes.&lt;/p&gt;
&lt;p&gt;To get started, we&amp;#8217;ve already set up a small MongoDB instance on &lt;a href="https://mongolab.com/welcome/" target="_blank"&gt;MongoLab&lt;/a&gt;, populated it with a random sampling of Twitter data from a single day (around 120,000 tweets), and created a read-only user for you.&lt;/p&gt;
&lt;p&gt;We&amp;#8217;ve also set up a public Github repo with a Mortar project that has three Pig scripts ready to run. Here&amp;#8217;s what you need to do:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;
&lt;p&gt;If you don’t already have a free Github account - &lt;a href="https://github.com/" target="_blank"&gt;create&lt;/a&gt; one. You’ll need a github username in step 4.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Sign into (or &lt;a href="https://app.mortardata.com/signup" target="_blank"&gt;create&lt;/a&gt;) your free Mortar account.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;After you receive the confirmation email, log into Mortar at &lt;a href="https://app.mortardata.com" target="_blank"&gt;&lt;a href="https://app.mortardata.com" target="_blank"&gt;https://app.mortardata.com&lt;/a&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Install the Mortar Development Framework:&lt;/p&gt;
&lt;pre&gt;    $ gem install mortar
&lt;/pre&gt;
&lt;p&gt;(full installation details &lt;a href="http://help.mortardata.com/#!/install_mortar_development_framework" target="_blank"&gt;here&lt;/a&gt;)&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Clone the example git project and register it as a mortar project:&lt;/p&gt;
&lt;pre&gt;    $ git clone git@github.com:mortardata/mongo-pig-examples.git
    
    $ cd mongo-pig-examples
    
    $ mortar register mongo-pig-examples
&lt;/pre&gt;
&lt;/li&gt;
&lt;/ol&gt;&lt;h4&gt;Script 1 - Characterize Collection&lt;/h4&gt;
&lt;p&gt;If you’re like most MongoDB users, you may not have a great sense of the different fields, data types, or values in your collection. We built characterize_collection.pig to deeply inspect your collection to extract that information.&lt;/p&gt;
&lt;p&gt;From the base directory of the mongo-pig-examples project you just cloned take a look at pigscripts/characterize_collection.pig. It loads all the data in the collection as a map, sends the map to Python (udfs/python/mongo_util.py) to gather a bunch of metadata, calculates some basic information about the collection, and then it writes the results out to an S3 bucket.&lt;/p&gt;
&lt;p&gt;To see this script in action let&amp;#8217;s run it on a 4 node Hadoop cluster. In your terminal (from the base directory of your mongo-pig-examples project) run:&lt;/p&gt;
&lt;pre&gt;    $ mortar run characterize_collection --clustersize 4
&lt;/pre&gt;
&lt;p&gt;This job will take about 10 minutes to finish. You can monitor the job&amp;#8217;s status on the command line or by going to &lt;a href="https://app.mortardata.com/jobs" target="_blank"&gt;&lt;a href="https://app.mortardata.com/jobs" target="_blank"&gt;https://app.mortardata.com/jobs&lt;/a&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Once the job has finished, you&amp;#8217;ll receive an email with a link to your job results. Clicking on this link will bring you into the Mortar web app, where you can download the results from s3. The output is described at the top of the characterize_collection script but as an example you can scroll down the output and find:&lt;/p&gt;
&lt;pre&gt;    …

    user.is_translator      2     false unicode     118806

    user.is_translator      2     true  unicode     31

    user.lang   26    en    unicode     114108

    user.lang   26    es    unicode     3462

    user.lang   26    fr    unicode     532

    user.lang   26    pt    unicode     281

    user.lang   26    ja    unicode     79

    user.listed_count 398   0     int   73757

    user.listed_count 398   1     int   18518

    …
&lt;/pre&gt;
&lt;p&gt;Looking at the values for user.lang - we see that there are 26 unique values for the field in our dataset. The most common was “en” with 114108 occurrences, the next most common was “es” with 3462 occurrences, and so on. To see the full results without running the job you can view the output file &lt;a href="https://github.com/mortardata/mongo-pig-examples/blob/master/example_output/Characterize%20collection%20result%20for%20Mortar%20Twitter%20sample.tsv" target="_blank"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Script 2 - Mongo Schema Generator&lt;/h4&gt;
&lt;p&gt;It can be tricky to properly declare Mongo’s highly nested schemas in Pig. Now, Pig is graceful—it can roll without a schema, or with inconsistent, or incorrect schemas. But it’s easier to read and write your Pig code if you have a schema because it allows you (and the Pig optimizer) to focus on just the relevant data.&lt;/p&gt;
&lt;p&gt;So this next script automatically generates a Pig schema by examining your MongoDB collection. If you don’t need the whole schema, you can easily edit it to keep just the fields you want.&lt;/p&gt;
&lt;p&gt;Running this script is similar to running the previous one. If you ran the Characterize Collection script in the past hour, the same cluster you used for that job should still be running. In that case, you can just run:&lt;/p&gt;
&lt;pre&gt;    $ mortar run mongo_schema_generator
&lt;/pre&gt;
&lt;p&gt;If you don&amp;#8217;t have a cluster that’s still running, just run the job on a new 4 node cluster like this:&lt;/p&gt;
&lt;pre&gt;    $ mortar run mongo_schema_generator --clustersize 4
&lt;/pre&gt;
&lt;h4&gt;Script 3 – Twitter Hourly Coffee Tweets&lt;/h4&gt;
&lt;p&gt;Using the pigscripts/hourly_coffee_tweets.pig script, we&amp;#8217;re going to demonstrate how we can use a small subset of the fields in our MongoDB collection. For our example, we’ll look at how often the word “coffee” is tweeted throughout the day. As with the Mongo Schema Generator script, you can run this job on an existing cluster or start up a new one.&lt;/p&gt;
&lt;h4&gt;Next Steps&lt;/h4&gt;
&lt;p&gt;If you already have a mongo instance/cluster based in US-East EC2, the first two example scripts should run on one of your collections with only minor modifications. You&amp;#8217;ll just need to:&lt;/p&gt;
&lt;ol&gt;&lt;li&gt;
&lt;p&gt;Update the MongoLoader connection strings in the pig scripts to connect to your MongoDB collections with one of your own users. If your mongo instance is on a non-standard port (any port other than 27017), just email us at &lt;a href="mailto:support@mortardata.com" target="_blank"&gt;support@mortardata.com&lt;/a&gt; to allow your Mortar account to access that port.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If you&amp;#8217;d like your jobs to write to one of your own S3 buckets, you can update the AWS keys associated with your Mortar account by following these instructions &lt;a href="http://help.mortardata.com/#!/create_a_new_web_project" target="_blank"&gt;to enable s3 access&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If you run out of free cluster hours with Mortar, you can &lt;a href="https://app.mortardata.com/account#!/plans" target="_blank"&gt;upgrade your account&lt;/a&gt; to get additional free hours each month.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;You can find more resources for learning Pig &lt;a href="http://help.mortardata.com/#!/pig_help_and_resources" target="_blank"&gt;here&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;If you have any questions or feedback, please contact us at &lt;a href="mailto:support@mortardata.com" target="_blank"&gt;support@mortardata.com&lt;/a&gt; or, if you&amp;#8217;re a Mortar user, ping us on in-app chat at &lt;a href="http://app.mortardata.com" target="_blank"&gt;app.mortardata.com&lt;/a&gt;.&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;&lt;/div&gt;</description><link>http://blog.mortardata.com/post/43080668046</link><guid>http://blog.mortardata.com/post/43080668046</guid><pubDate>Thu, 14 Feb 2013 11:08:00 -0500</pubDate><category>MongoDB</category><category>Mongo Loader</category><category>Pig</category><category>Hadoop</category></item><item><title>Why we started the NYC Data Science Meetup</title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;a href="http://mortardata.com/team" target="_self"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_self"&gt; &lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p&gt;New York’s data science community has been building since long before “data science” was used to describe it.  In addition to a long history of advertising and adtech companies, the recent startup explosion here in NYC has been largely led by companies built to leverage data science (including FourSquare, Tumblr, AppNexus, and Knewton, to name just a few).&lt;!-- more --&gt;&lt;/p&gt;
&lt;p&gt;This community has been fortunate to have a handful of fantastic advocates – people like &lt;a href="http://www.hilarymason.com/about/" target="_blank"&gt;Hilary Mason&lt;/a&gt;, &lt;a href="http://www.drewconway.com/Drew_Conway/About.html" target="_blank"&gt;Drew Conway&lt;/a&gt;, &lt;a href="http://www.johnmyleswhite.com/" target="_blank"&gt;John Myles White&lt;/a&gt;, &lt;a href="http://jakeporway.com/about/" target="_blank"&gt;Jake Porway&lt;/a&gt;, and others – who have worked tirelessly to create great educational opportunities like &lt;a href="http://www.datagotham.com/" target="_blank"&gt;DataGotham&lt;/a&gt; and amazing initiatives like &lt;a href="http://datakind.org/" target="_blank"&gt;DataKind&lt;/a&gt;.  Yet our conversations with people in the data science community (and the traffic explosion resulting from &lt;a href="http://blog.mortardata.com/post/40602271238/7-books-to-supercharge-your-data-education" target="_blank"&gt;our recent data education-focused post&lt;/a&gt;) helped us realize that people wanted even more opportunities to learn and collaborate.&lt;br/&gt;&lt;br/&gt;So we started &lt;a href="http://www.meetup.com/NYC-Data-Science/" target="_blank"&gt;the NYC Data Science meetup&lt;/a&gt; to help meet this growing demand.  Even then, we were surprised when the group grew to 160 members in just 24 hours … all without a scheduled event.&lt;br/&gt;&lt;br/&gt;And it’s not just current or aspiring data scientists who are thirsty to learn more.  Software engineers, statisticians, hackers, and many others are increasingly recognizing the value of rounding out their own skills with a broader education in data science.&lt;br/&gt;&lt;br/&gt;Given its focus on education, we couldn’t think of a better speaker for the inaugural NYC Data Science meetup than Hilary Mason, bitly’s Chief Scientist and one of the most influential data scientists in New York (or anywhere else, for that matter).  Hilary is not only a tremendous advocate for NYC’s data science community, she also has a unique gift for making challenging subject matter accessible and exciting to audiences of all experience levels.&lt;br/&gt;&lt;br/&gt;If you love working with data and want to meet other awesome people who share your passion, &lt;a href="http://www.meetup.com/NYC-Data-Science/events/100703012/" target="_blank"&gt;sign up and join us&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;</description><link>http://blog.mortardata.com/post/41195980748</link><guid>http://blog.mortardata.com/post/41195980748</guid><pubDate>Tue, 22 Jan 2013 09:56:00 -0500</pubDate><category>data science</category><category>NYC</category></item><item><title>7 books to supercharge your data education</title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;a href="http://mortardata.com/team" target="_blank"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_blank"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Working with data is HARD.&lt;span&gt;  &lt;/span&gt;Let’s face it, you’re brave to even attempt it, let alone make it your everyday job.&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Fortunately, some incredibly talented people have taken the time to compile and share their deep knowledge for you.&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Here are 7 books we recommend for picking up some new skills in 2013:&lt;/span&gt;&lt;strong&gt;&lt;a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520" target="_blank"&gt;&lt;span&gt;&lt;span&gt;&lt;br/&gt;&lt;!-- more --&gt;&lt;br/&gt;&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;strong&gt;&lt;a href="http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520" target="_blank"&gt;&lt;span&gt;Hadoop: The Definitive Guide&lt;/span&gt;&lt;/a&gt;&lt;/strong&gt;&lt;span&gt; by &lt;/span&gt;&lt;span&gt;&lt;a href="http://www.tom-e-white.com/p/about.html" target="_blank"&gt;&lt;span&gt;Tom White&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Tom works at Cloudera and is one of the foremost experts on Hadoop, having been an Apache Hadoop committer since February 2007.&lt;span&gt;  &lt;/span&gt;He is a Hadoop PMC member and a member of the Apache Software Foundation.&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;em&gt;&lt;span&gt;What people have said:&lt;/span&gt;&lt;/em&gt;&lt;span&gt;  “A comprehensive, ‘roll up your sleeves, here&amp;#8217;s some Java’ deep dive into Hadoop…  No single book will do Hadoop justice, but this book is the best attempt so far.” (via &lt;a href="http://www.amazon.com/review/R3H2YVQKFMXAPK/ref=cm_cr_pr_perm?ie=UTF8&amp;amp;ASIN=B0082FE448&amp;amp;linkCode=&amp;amp;nodeID=&amp;amp;tag=" target="_blank"&gt;&lt;span&gt;Amazon&lt;/span&gt;&lt;/a&gt;)&lt;/span&gt;&lt;span&gt;&lt;br/&gt;&lt;br/&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;&lt;strong&gt;&lt;a href="http://ofps.oreilly.com/titles/9781449302641/" target="_blank"&gt;&lt;span&gt;Programming Pig&lt;/span&gt;&lt;/a&gt;&lt;/strong&gt; by &lt;a href="http://hortonworks.com/blog/meet-the-committer-part-one-alan-gates/" target="_blank"&gt;&lt;span&gt;Alan Gates&lt;/span&gt;&lt;/a&gt; (free online version!)&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Alan open-sourced Pig while at Yahoo! and later designed HCatalog.&lt;span&gt;  &lt;/span&gt;He’s currently a co-founder at Hortonworks, where he continues his extensive work on open-source projects.&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;em&gt;&lt;span&gt;What people have said:&lt;/span&gt;&lt;/em&gt;&lt;span&gt;  “[T]his is an excellent book that covers the details of using Pig, from basic to advanced features.  It saved my bacon (if you&amp;#8217;ll pardon the expression&amp;#8230;) numerous times on a recent, challenging project.” (via &lt;a href="http://shop.oreilly.com/product/0636920018087.do" target="_blank"&gt;&lt;span&gt;O’Reilly&lt;/span&gt;&lt;/a&gt;)&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;(Bonus: We’ve compiled some additional Pig resources &lt;a href="http://help.mortardata.com/#%21/pig_help_and_resources" target="_blank"&gt;here&lt;/a&gt;&lt;/span&gt;&lt;span&gt;.)&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;strong&gt;&lt;a href="http://www.amazon.com/NoSQL-Distilled-Emerging-Polyglot-Persistence/dp/0321826620" target="_blank"&gt;&lt;span&gt;NoSQL Distilled&lt;/span&gt;&lt;/a&gt;&lt;/strong&gt;&lt;span&gt; by &lt;/span&gt;&lt;span&gt;&lt;a href="http://www.sadalage.com/" target="_blank"&gt;&lt;span&gt;Pramod Sadalage&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span&gt; and &lt;/span&gt;&lt;span&gt;&lt;a href="http://martinfowler.com/" target="_blank"&gt;&lt;span&gt;Martin Fowler&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span&gt;&lt;br/&gt;&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Pramod is as a DBA and developer at ThoughtWorks, an enterprise application development and integration company.  He pioneered the practices and processes of evolutionary database design and database refactoring.&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Martin is Thoughtworks’ Chief Scientist and pioneered various topics around object-oriented technology and agile methods.&lt;span&gt;  &lt;/span&gt;He’s an active speaker and author, having written six books on software development.&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;em&gt;&lt;span&gt;What people have said:&lt;/span&gt;&lt;/em&gt;&lt;span&gt; “The authors of this book present a wonderful, accessible, product-agnostic introduction to the world of NoSQL…  This book has demystified much of NoSQL for me and made it seem quite common-sensical.” (via &lt;a href="http://www.amazon.com/review/RLDC4MHW1IM6T/ref=cm_cr_rdp_perm?ie=UTF8&amp;amp;ASIN=0321826620&amp;amp;linkCode=&amp;amp;nodeID=&amp;amp;tag=" target="_blank"&gt;&lt;span&gt;Amazon&lt;/span&gt;&lt;/a&gt;)&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;strong&gt;&lt;a href="http://www.amazon.com/Python-Data-Analysis-Wes-McKinney/dp/1449319793" target="_blank"&gt;&lt;span&gt;Python for Data Analysis&lt;/span&gt;&lt;/a&gt;&lt;/strong&gt;&lt;span&gt; by &lt;/span&gt;&lt;span&gt;&lt;a href="http://blog.wesmckinney.com/" target="_blank"&gt;&lt;span&gt;Wes McKinney&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span&gt;.&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Wes is Python’s pied piper of data analysis.&lt;span&gt;  &lt;/span&gt;The MIT math major is the main developer of pandas&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;span&gt;, a Python data analysis library, and co-founder of Lambda Foundry.&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;em&gt;&lt;span&gt;What people have said:&lt;/span&gt;&lt;/em&gt;&lt;span&gt;  “One of the best and most practical programming books I&amp;#8217;ve ever read.&lt;span&gt;  &lt;/span&gt;Amazing job at introducing tools (ipython, pandas) that aren&amp;#8217;t well covered on the web.” (via &lt;a href="http://shop.oreilly.com/product/0636920023784.do" target="_blank"&gt;&lt;span&gt;O’Reilly&lt;/span&gt;&lt;/a&gt;)&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;strong&gt;&lt;a href="http://www.amazon.com/Machine-Learning-Hackers-Drew-Conway/dp/1449303714" target="_blank"&gt;&lt;span&gt;Machine Learning for Hackers&lt;/span&gt;&lt;/a&gt;&lt;/strong&gt;&lt;span&gt; by &lt;/span&gt;&lt;span&gt;&lt;a href="http://www.drewconway.com" target="_blank"&gt;&lt;span&gt;Drew Conway&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span&gt; and &lt;/span&gt;&lt;span&gt;&lt;a href="http://www.johnmyleswhite.com/" target="_blank"&gt;&lt;span&gt;John Myles White&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Drew is kind of a big deal in NYC’s data community: in addition to being a PhD candidate at NYU, he is IA Ventures’ “Scientist-in-Residence”, a co-organizer of Data Gotham, and co-founder of DataKind.&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;John is a Ph.D. candidate in the Department of Psychology at Princeton University, where he leverages his mathematical modeling and machine learning chops to understand human decision-making. &lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;em&gt;&lt;span&gt;What people have said:&lt;/span&gt;&lt;/em&gt;&lt;span&gt;  “Drew and John have written an excellent book on presenting machine learning concepts like classification, clustering, recommendation, network graphs, and SVMs to name a few. The authors do a great job of presenting how to apply these machine learning algorithms and explain the general concepts of the algorithms.” (via &lt;a href="http://www.amazon.com/review/R1A0UJT3BIM6T7/ref=cm_cr_pr_perm?ie=UTF8&amp;amp;ASIN=1449303714&amp;amp;linkCode=&amp;amp;nodeID=&amp;amp;tag=" target="_blank"&gt;Amazon&lt;/a&gt;)&lt;br/&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;&lt;br/&gt;&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;strong&gt;&lt;a href="http://shop.oreilly.com/product/0636920025085.do" target="_blank"&gt;&lt;span&gt;Hadoop Operations&lt;/span&gt;&lt;/a&gt;&lt;/strong&gt;&lt;span&gt; by &lt;a href="http://www.linkedin.com/in/esammer" target="_blank"&gt;&lt;span&gt;Eric Sammer&lt;/span&gt;&lt;/a&gt;&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Eric is a principal architect at Cloudera and an active speaker on large scale data processing, integration, and system management.&lt;span&gt;  &lt;/span&gt;Prior to Cloudera, he worked at various startups for over a decade as a DBA, SysAdmin, software engineer, and system architect. &lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;em&gt;&lt;span&gt;What people have said:&lt;/span&gt;&lt;/em&gt;&lt;span&gt;  “Whether the topic is HDFS and how data is ingested and replicated, or how Map/Reduce &amp;#8220;finds&amp;#8221; the most suitable node to run it&amp;#8217;s tasks on, or what the cost and performance advantages are of adopting the shared-nothing, commodity model recommended for Hadoop clusters, etc., etc., etc., this book provides the how, what, when, where and why of Hadoop (the missing manual, of sorts).” (via &lt;a href="http://www.amazon.com/review/RU73YHDRUSQN5/ref=cm_cr_rdp_perm?ie=UTF8&amp;amp;ASIN=1449327052&amp;amp;linkCode=&amp;amp;nodeID=&amp;amp;tag=" target="_blank"&gt;&lt;span&gt;Amazon&lt;/span&gt;&lt;/a&gt;) &lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;strong&gt;&lt;a href="http://ofps.oreilly.com/titles/9781449326265/index.html" target="_blank"&gt;&lt;span&gt;Agile Data&lt;/span&gt;&lt;/a&gt;&lt;/strong&gt;&lt;span&gt; by &lt;/span&gt;&lt;a href="http://datasyndrome.com/" target="_blank"&gt;&lt;span&gt;&lt;span&gt;Russell Jurney&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;&lt;span&gt; (free online version!)&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Those who have met Russell (or followed him on Twitter) know him as a hilarious force of nature, but his data science chops are no joke.&lt;span&gt;  &lt;/span&gt;After working at Ning and LinkedIn, Russell is now Hortonworks’ Hadoop Evangelist.&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;span&gt;What people have said:&lt;/span&gt;&lt;/em&gt;&lt;span&gt;  “This is definitely the best book I&amp;#8217;ve ever written.”  (Nice review, Russell…)  (via &lt;a href="http://shop.oreilly.com/product/0636920025054.do" target="_blank"&gt;&lt;span&gt;O’Reilly&lt;/span&gt;&lt;/a&gt;)&lt;/span&gt;&lt;span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt; &lt;/span&gt;&lt;/p&gt;
&lt;p&gt;&lt;span&gt;&lt;em&gt;Want an easy way to test your skills for free?  &lt;a href="http://www.mortardata.com" target="_blank"&gt;Try Mortar&amp;#8217;s web projects&lt;/a&gt; and get up-and-running quickly on Hadoop (no previous experience required).&lt;/em&gt;&lt;br/&gt;&lt;/span&gt;&lt;/p&gt;
&lt;/div&gt;</description><link>http://blog.mortardata.com/post/40602271238</link><guid>http://blog.mortardata.com/post/40602271238</guid><pubDate>Tue, 15 Jan 2013 10:04:00 -0500</pubDate><category>Pig</category><category>Hadoop</category><category>Python</category><category>Machine Learning</category><category>NoSQL</category></item><item><title>Mortar co-founder Jeremy Karn gave this talk on using MongoDB...</title><description>&lt;object name="kaltura_player_1355413120" id="kaltura_player_1355413120" type="application/x-shockwave-flash" height="330" width="400" data="http://www.kaltura.com/index.php/kwidget/wid/1_qy4clfaz/uiconf_id/48501"&gt;&lt;param name="allowScriptAccess" value="always" /&gt;&lt;param name="allowNetworking" value="all" /&gt;&lt;param name="allowFullScreen" value="true" /&gt;&lt;param name="bgcolor" value="#000000" /&gt;&lt;param name="movie" value="http://www.kaltura.com/index.php/kwidget/wid/1_qy4clfaz/uiconf_id/48501" /&gt;&lt;param name="flashVars" value="" /&gt;&lt;/object&gt;&#13;
&lt;iframe src="http://www.slideshare.net/slideshow/embed_code/15609639" width="427" height="356" frameborder="0" marginwidth="0" marginheight="0" scrolling="no" style="border:1px solid #CCC;border-width:1px 1px 0;margin-bottom:5px"&gt; &lt;/iframe&gt; &lt;br/&gt;&lt;br/&gt;&lt;p&gt;Mortar co-founder Jeremy Karn gave this talk on using MongoDB data with Hadoop (and specifically with Apache Pig) at MongoSV.&lt;/p&gt;
&lt;p&gt;Jeremy’s presentation covers the steps needed to read JSON from Mongo into Pig, parallel process it on Hadoop with sophisticated functions, and write back to Mongo.&lt;/p&gt;
&lt;p&gt;Jeremy was a big part of our contributions to the Mongo Hadoop connector, which we extended it to make it work with Pig. MongoDB creator (and 10gen founder) Dwight Merriman &lt;a href="http://bit.ly/TdbiNt" target="_blank"&gt;also gave Mortar a nice shout out&lt;/a&gt;.&lt;/p&gt;</description><link>http://blog.mortardata.com/post/37802590670</link><guid>http://blog.mortardata.com/post/37802590670</guid><pubDate>Wed, 12 Dec 2012 14:28:00 -0500</pubDate><category>MongoDB</category><category>MongoSV</category><category>Presentations</category><category>Pig</category></item><item><title>AWS re: Invent Startup Launch</title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;a href="http://mortardata.com/team" target="_blank"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_blank"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p&gt;Here are the slides and video from our public launch at AWS re: Invent, Amazon Web Services’ first-ever user conference.&lt;/p&gt;
&lt;p&gt;We had a great time and met a lot of awesome people, including Donnie Berkholz from Redmonk, who shared his thoughts about Mortar on Twitter:&lt;/p&gt;
&lt;blockquote class="twitter-tweet"&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;div&gt;
&lt;p&gt;.@&lt;a href="https://twitter.com/mortardata" target="_blank"&gt;mortardata&lt;/a&gt; &amp;#8212; Rails approach to Big Data. Opinionated open-source framework + PaaS, Python/Java. One of the coolest things at &lt;a href="https://twitter.com/search/%23reinvent" target="_blank"&gt;#reinvent&lt;/a&gt;.&lt;/p&gt;
— Donnie Berkholz (@dberkholz) &lt;a href="https://twitter.com/dberkholz/status/274323112894533632" target="_blank"&gt;November 30, 2012&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;/blockquote&gt;
&lt;p&gt;Thanks to Donnie for the kind words and AWS for a great event.  Video and slides are below the cut:&lt;/p&gt;
&lt;p&gt;&lt;!-- more --&gt;&lt;/p&gt;
&lt;p&gt;&lt;br/&gt;       &lt;iframe frameborder="0" height="281" src="http://player.vimeo.com/video/57015411" width="500"&gt;&lt;/iframe&gt;&lt;/p&gt;
&lt;p&gt;&lt;br/&gt;               &lt;iframe frameborder="0" height="356" marginheight="0" marginwidth="0" scrolling="no" src="http://www.slideshare.net/slideshow/embed_code/15590947?rel=0" width="427"&gt; &lt;/iframe&gt;&lt;/p&gt;
&lt;p&gt;
&lt;script type="mce-mce-mce-mce-text/javascript"&gt;// &lt;![CDATA[
// &lt;![CDATA[
// &lt;![CDATA[
// &lt;![CDATA[
!function(d,s,id){var js,fjs=d.getElementsByTagName(s)[0];if(!d.getElementById(id)){js=d.createElement(s);js.id=id;js.src="//platform.twitter.com/widgets.js";fjs.parentNode.insertBefore(js,fjs);}}(document,"script","twitter-wjs");
// ]]]]]]]]&gt;&lt;![CDATA[&gt;&lt;![CDATA[&gt;&lt;![CDATA[&gt;
// ]]]]]]&gt;&lt;![CDATA[&gt;&lt;![CDATA[&gt;
// ]]]]&gt;&lt;![CDATA[&gt;
// ]]&gt;&lt;/script&gt;&lt;/p&gt;
&lt;/div&gt;</description><link>http://blog.mortardata.com/post/38258745936</link><guid>http://blog.mortardata.com/post/38258745936</guid><pubDate>Tue, 11 Dec 2012 18:21:00 -0500</pubDate><category>re: Invent</category><category>Presentations</category></item><item><title>Hadoop, Pig, and Python at PyData</title><description>&lt;div class="author"&gt;
&lt;h3&gt;&lt;a href="http://mortardata.com/team" target="_blank"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_blank"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p&gt;Our CEO, K Young, spoke at PyData NYC abpit using real Python with Pig, and why we integrated these two awesome languages.  The audience asked some great questions, many of which you can see at the end of the video.&lt;/p&gt;
&lt;p&gt;Here is the video (with slides just below):&lt;/p&gt;
&lt;p&gt;&lt;!-- more --&gt;&lt;/p&gt;
&lt;div class="copy"&gt;
&lt;p&gt;&lt;iframe frameborder="0" height="281" src="http://player.vimeo.com/video/53111093?badge=0" width="500"&gt;&lt;/iframe&gt;&lt;/p&gt;
&lt;/div&gt;
&lt;div class="copy"&gt;&lt;iframe frameborder="0" height="356" marginheight="0" marginwidth="0" scrolling="no" src="http://www.slideshare.net/slideshow/embed_code/15689775?rel=0" width="427"&gt; &lt;/iframe&gt;
&lt;div&gt;&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;</description><link>http://blog.mortardata.com/post/38256250189</link><guid>http://blog.mortardata.com/post/38256250189</guid><pubDate>Thu, 06 Dec 2012 17:49:00 -0500</pubDate><category>Hadoop</category><category>Pig</category><category>Presentations</category><category>Python</category><category>PyData</category></item><item><title>MongoDB + Pig talk at MongoSV</title><description>&lt;p&gt;&lt;div class="author"&gt;
&lt;h3&gt;&lt;a href="http://mortardata.com/team" target="_blank"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_blank"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Mortar is pleased to sponsor &lt;a href="http://www.mongosv.com/?utm_source=companyblog&amp;amp;utm_medium=blog&amp;amp;utm_campaign=companyblogposts" target="_blank"&gt;&lt;span&gt;&lt;span class="il"&gt;MongoSV&lt;/span&gt;&lt;/span&gt;&lt;/a&gt; on December 4&lt;/span&gt;&lt;sup&gt;&lt;span&gt;th&lt;/span&gt;&lt;/sup&gt;&lt;span&gt;, an annual one-day conference in Silicon Valley, CA dedicated to the open source, non-relational database MongoDB.  &lt;/span&gt;&lt;/p&gt;
&lt;div class="im"&gt;&lt;/div&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Our very own lead engineer, Jeremy Karn, will deliver a talk entitled MongoDB + Pig, which will teach attendees how to process MongoDB data with Hadoop—specifically with Apache Pig.  As many of you know, we&amp;#8217;ve committed to the MongoDB+Hadoop connector, extending it to work with Pig.  Jeremy&amp;#8217;s talk will cover the steps needed to read JSON from Mongo into Pig, parallel process it on Hadoop with sophisticated functions, and write back to Mongo.&lt;/span&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;For more information about &lt;span class="il"&gt;MongoSV&lt;/span&gt;, check out the &lt;a href="http://www.10gen.com/events/mongosv#agenda" target="_blank"&gt;&lt;span&gt;agenda&lt;/span&gt;&lt;/a&gt; or 10gen’s blog post &lt;a href="http://blog.10gen.com/post/34700229730/get-ready-for-mongosv" target="_blank"&gt;&lt;span&gt;Get Ready for &lt;span class="il"&gt;MongoSV&lt;/span&gt;&lt;/span&gt;&lt;/a&gt;. And if you want to save a few bucks, enter the discount code “mortar20” and get 20% off.&lt;/span&gt;&lt;/p&gt;
&lt;/div&gt;&lt;/p&gt;</description><link>http://blog.mortardata.com/post/36955460440</link><guid>http://blog.mortardata.com/post/36955460440</guid><pubDate>Sat, 01 Dec 2012 10:49:00 -0500</pubDate></item><item><title>Announcing our public launch</title><description>&lt;p&gt;&lt;div class="author"&gt;
&lt;h3&gt;&lt;a href="http://mortardata.com/team" target="_blank"&gt;Scott Haylon&lt;/a&gt;&lt;/h3&gt;
&lt;a href="http://mortardata.com/team" target="_blank"&gt;&lt;img alt="image" src="http://mortardata.com/assets/team-blog/Scott_Haylon.png"/&gt;&lt;/a&gt;&lt;/div&gt;
&lt;div class="copy"&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Last week, &lt;/span&gt;&lt;a href="http://blog.mortardata.com/post/36081130124/we-raised-our-seed-round" target="_blank"&gt;&lt;span&gt;we announced our $1.8 million fundraising&lt;/span&gt;&lt;/a&gt;&lt;span&gt;. For those of you who follow big data startups, our blog post probably felt…underwhelming. Startups typically come out and make a huge publicity splash, jam-packed with buzzwords and vision galore. While we feel very fortunate to have what we need to help us grow, we know that VC funding is merely a means, and not an end.&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;But now you get to see us get really excited, because Mortar’s Hadoop PaaS and open source framework for big data is now publicly available. This means if you want to try it, &lt;/span&gt;&lt;a href="http://mortardata.com/#%21/try_it" target="_blank"&gt;&lt;span&gt;you can activate your trial right now on our site&lt;/span&gt;&lt;/a&gt;&lt;span&gt; without having to talk to anyone (unless you want to!).&lt;!-- more --&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;You can get started on Mortar using Web Projects (using Mortar entirely online through the browser) or Git Projects (using Mortar locally on your own machine with the Mortar development framework). You can see more info about both &lt;/span&gt;&lt;a href="http://www.mortardata.com/#%21/in_action" target="_blank"&gt;&lt;span&gt;here&lt;/span&gt;&lt;/a&gt;&lt;span&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;All trial accounts come with our full Hadoop PaaS, unlimited use of the Mortar framework, our site, and dev tools, and 10 free Hadoop node-hours. (You can get another 15 free node-hours per month and additional support at no cost by simply adding your credit card to the account.)&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;strong&gt;&lt;span&gt;Where we started…&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Our team has been banging our heads against big data for a long time. We’ve felt the pain of cumbersome ETL systems and other tools, and we were so excited about the potential of Hadoop. However when we started using it, we quickly realized it was really only accessible to large companies with deep pockets and big teams, and even then it took a VERY long time to get from “Hey, we want to use Hadoop!” to “Sweet! We’re using Hadoop!”&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Our first iteration of Mortar was a browser-based version that made it easy to write jobs on our Hadoop PaaS using Apache Pig and Python. It was a great way for us (and our users) to get started quickly, but we always had a bigger vision, which we’re just now bringing to the public.&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;strong&gt;&lt;span&gt;…and what’s new.&lt;/span&gt;&lt;/strong&gt;&lt;span&gt;&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Mortar is an open source, code-based platform for big data.&lt;span&gt;  &lt;/span&gt;As a company, our Hadoop PaaS hosts and executes Mortar projects as a service. We’ve partnered with Amazon Web Services and built Mortar entirely on AWS, backed by their &lt;/span&gt;&lt;a href="http://aws.amazon.com/elasticmapreduce/" target="_blank"&gt;&lt;span&gt;Elastic MapReduce&lt;/span&gt;&lt;/a&gt;&lt;span&gt; offering.&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;You use Mortar by writing a combination of Pig (which is like SQL) and real Python. Historically, Hadoop has only worked with Jython, but the data science community told us over and over how deeply they depended on libraries like NumPy, SciPy, and NLTK. They were dying to use Hadoop, but they were being forced to choose between one set of tools or another. We wanted to fix this, so we made Hadoop and Python play nice together.&lt;/span&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;Our key focuses in building Mortar’s Hadoop PaaS are:&lt;/span&gt;&lt;/p&gt;
&lt;ul&gt;&lt;li&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;Ease of use – engineers and data scientists can get started quickly using the tools they love, without training&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;Collaboration – share/repeat/maintain your code using Git or other revision control&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;Open source – customers should never be locked in&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;Convention over configuration – do automated testing, find errors quickly&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span&gt;&lt;span&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;Removing all non-core elements of building data pipelines – you shouldn’t waste time worrying about infrastructure and operations&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;&lt;p class="MsoNormal"&gt;&lt;strong&gt;&lt;span&gt;Does it work for you?&lt;/span&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p class="MsoNormal"&gt;&lt;span&gt;I could write at length about how Mortar works, but you’re better off &lt;/span&gt;&lt;a href="http://mortardata.com/#%21/in_action" target="_blank"&gt;&lt;span&gt;seeing it in action&lt;/span&gt;&lt;/a&gt;&lt;span&gt;. Of course, the easiest way to really know if any software works for you is to try it for yourself. So if you want to do ETL, natural language processing, aggregations, machine learning, regression analysis, or some other big data analysis, you can &lt;/span&gt;&lt;a href="http://mortardata.com/#%21/try_it" target="_blank"&gt;&lt;span&gt;try Mortar free&lt;/span&gt;&lt;/a&gt;&lt;span&gt;.&lt;/span&gt;&lt;/p&gt;
&lt;/div&gt;&lt;/p&gt;</description><link>http://blog.mortardata.com/post/36733594494</link><guid>http://blog.mortardata.com/post/36733594494</guid><pubDate>Wed, 28 Nov 2012 07:00:00 -0500</pubDate><category>Mortar</category></item></channel></rss>
