Now that 2013 is coming to a close, we’ve been doing a lot of reflecting. It has been an awesome year at Mortar, and we’ve truly enjoyed trying to bring you the very best Hadoop, Pig, and data science content.
We know it’s tough to keep up with every blog post we write (not to mention the dozens of other blog posts you’ve still got saved for later), so as thanks for keeping tabs on us, we wanted to share our most popular posts from 2013. If you’ve read all of these already, well I guess we’ll have to get you something even nicer in the New Year.
Since we do a lot of experimenting with data, we’re always excited to find new datasets to use with Mortar. We’re saving bookmarks and sharing datasets with our team on a nearly-daily basis.
There are tons of resources throughout the web, but given our love for the data scientist community, we thought we’d pick out a few of the best dataset lists curated by data scientists.
Below is a collection of six great dataset lists from both famous data scientists and those who aren’t well-known:
We’re big fans of GitHub. There are a lot of things to like about the company and the fantastic service they’ve built. However, one of the things we’ve come to admire most about GitHub is their pricing model.
If you’re giving back to the community by making your work public, you can use GitHub for free. It’s a great approach that drives tremendous benefits to the GitHub community.
Starting today, Mortar is following GitHub’s lead in supporting those who contribute to the data science community.
Soren Macbeth, Chief Scientist/Data Hacker at Yieldbot and former co-founder of Stocktwits, joined us for the NYC Data Science Meetup last month. Unfortunately, we had a projector issue that prevented us from showing slides, but Soren rolled with it and did the entire talk from memory.
In the video, Soren discusses how Yieldbot operationalized their data science efforts using tools like ElephantDB and Storm. He also touches on his personal philosophy when it comes to research versus production work (and how Clojure fits into that) and some of the challenges Yieldbot has run into when operationalizing machine learning systems.
Here’s a quick summary of the video:
Our CEO, K Young, was interviewed following his talk on MongoDB and Hadoop at MongoNYC. Here’s a quick rundown of the highlights:
- The difference between Mortar and the raw infrastructure of Amazon Elastic MapReduce (4:09)
- Why Mortar cares about collaborative and repeatable data science (5:49)
- Using Hadoop with MongoDB (11:24) [We’ve also written and spoken about MongoDB and Hadoop in the past few months.]
- Making the business case for Hadoop (15:42)
Here’s the embedded video:
If you want Hilary Mason, Drew Conway, Max Shron, or Eric Colson to build your recommender for free, enter your email address here.
As a platform for working with data, we’ve seen users tackle lots of interesting use-cases: log analysis, natural language processing, pattern detection, and many more.
However, perhaps no use-case is in greater demand than recommender systems. If you have more “inventory” than your users can easily find (whether it’s news, jobs, videos, restaurants, vacations, recipes, apps, etc.), a great recommender is crucial to driving engagement.
The problem is that recommender systems are really hard to implement, so most companies either don’t have one or aren’t happy with what they have.
What makes recommenders so tough?
Our second NYC Data Science Meetup featured Tumblr data scientist Adam Laiacano, who discussed the analytics stack at Tumblr and the tools he and his team use to organize and analyze data.
Here are the video and slides from Adam’s talk, which cover Tumblr’s use of Scribe, Hive & Pig, Hue, and Vowpal Wabbit:
Thanks to everyone who came out to our inaugural NYC Data Science Meetup. For those who couldn’t attend, Hilary Mason fought off jetlag and a tough cold to give a great presentation.
Below is a 12-minute clip from Hilary’s talk, which she called “Dirty Secrets of Data Science.”
New York’s data science community has been building since long before “data science” was used to describe it. In addition to a long history of advertising and adtech companies, the recent startup explosion here in NYC has been largely led by companies built to leverage data science (including FourSquare, Tumblr, AppNexus, and Knewton, to name just a few).
"Big data" entered our language before anyone knew what it meant. So then we spent a lot of time discussing it: "Is it really about the ‘bigness’?”, “Isn’t it about non-relational data?”, “No wait, it’s about the the need for speed." This got boiled down to the three Vs (volume, variety, velocity), but then “big data” just meant three things, which didn’t clarify much at all.
So we, the tech community, are developing new vocabulary and distinctions, and in 2013, no one is going to say “big data” anymore. (Actually, given that Dilbert already skewered big data, its heyday may already be over.)
This is the life-cycle of any good buzzword. A buzzword is born when something so new and important is happening that we need to talk about it before we understand it; while it is still amorphous. It refers to a family of related concepts. Then we develop greater understanding and distinctions, and pretty soon you’re embarrassed for your colleague when he trots out last year’s buzzword (remember Web 2.0?).
So what is the crux of “big data”? Why is it so new and important that we have to talk about it with a buzzword? In short, we’re all freaking out because old bottlenecks recently got shattered, the new bottlenecks are us and our existing tools, and mad riches are visible just over the horizon. (And it’s not just about riches — there’s also massive potential for human improvement.   )