Many data scientists are working with data gathered from human beings using web applications. If they’re lucky, that data was gathered intentionally and is relatively clean.
Fortunately for most of these data scientists, human behavior is fairly consistent. And most of these data scientists have fairly similar goals for that data—which isn’t a surprise, given the origin of the data. For example, they might be trying to figure out how to encourage more people to subscribe to a service, or to finish the checkout process, or to buy more items on an e-commerce site. For the most part, we use data gathered from behavior in applications to make those applications better.
For a long time, data scientists and engineers had to choose between leveraging the power of Hadoop and using Python’s amazing data science libraries (like NLTK, NumPy, and SciPy). It’s a painful decision, and one we thought should be eliminated.
So about a year ago, we solved this problem by extending Pig to work with CPython, allowing our users to take advantage of Hadoop with real Python (see our presentation here). To say Mortar users have loved that combination would be an understatement.
However, only Mortar users could use Pig and real Python together…until now.
Did you always want your own Twitter dataset to work with? Well, you can have one for free—our open source Twitter Gardenhose.
If you want to take advantage of the Twitter Gardenhouse, you have 2 options:
- Read it directly from our S3 bucket
- Store it to your S3 bucket: The README describes how to deploy on Heroku—it should take you about 30 minutes to set up and get running. It’s a surprisingly simple node.js app.