As data geeks, we get some good-n-geeky data questions.  A common one is, “What criteria should I use to decide which data to store and which to discard?”

Which is great — I love questions with straightforward answers.  Answer: If you think there’s even a slight possibility it could be valuable at some point, store it.  For most people that means STORE ALL OF YOUR DATA.

But isn’t it a waste of money to save all of that data if you’re not sure whether you’ll use it?  No — it’s probably not.

Innovation and aggressive competition are driving a rapid decline in storage costs, and new technologies are making data analysis very accessible.  As a result, the potential opportunity costs far outweigh what you’d spend on cloud storage.

For example, let’s take a look at a particularly popular dataset: the Twitter Firehose.  [For those who aren’t familiar, the Twitter Firehose is a stream of all tweets on Twitter.]

