As data geeks, we get some good-n-geeky data questions.  A common one is, “What criteria should I use to decide which data to store and which to discard?”

Which is great — I love questions with straightforward answers.  Answer: If you think there’s even a slight possibility it could be valuable at some point, store it.  For most people that means STORE ALL OF YOUR DATA.

But isn’t it a waste of money to save all of that data if you’re not sure whether you’ll use it?  No — it’s probably not.

Innovation and aggressive competition are driving a rapid decline in storage costs, and new technologies are making data analysis very accessible.  As a result, the potential opportunity costs far outweigh what you’d spend on cloud storage.

For example, let’s take a look at a particularly popular dataset: the Twitter Firehose.  [For those who aren’t familiar, the Twitter Firehose is a stream of all tweets on Twitter.]

If we wanted to store the entire history of Tweets on Twitter, it would cost less than $6K.  That’s right — you can store every tweet since the dawn of man for about what you’d pay for a couple of nicer MacBook Pros.

Chances are, you don’t have as much data as Twitter, and I’m also willing to bet that the data you do have is worth much more to your business than $6K/year.

Now, check out this sweet tweet math:

—————
MATH
Here’s a graph showing Tweets per day (from blog.twitter.com and Wikipedia):

image

How many cumulative tweets is that?  Let’s calculate the area under the graph by decomposing each time segment into two sections, as below.

  • Area A = Days in segment * “initial tweets per day”
  • Area B = Days in segment * (“final tweets per day” - “initial tweets per day”) * 0.5

image
image

So there have been about 170B tweets tweeted forth into the world. The average tweet is about 0.3 KB compressed, so:

170 billion tweets * 0.3 KB/tweet = 51,000,000,000 KB ~= 49,000 GB

Amazon provides two long-term durable storage options —S3 and the newly-announced Glacier.  Here are their associated costs:

  • S3 = $0.11 per GB/mo ($0.083/GB for reduced redundancy storage)
  • Glacier = $0.01 per GB/mo

So here’s the ongoing annual cost of storing every tweet ever:

  • S3: 49,000 GB  *  $0.11/mo * 12 months/yr  =  $64,680 ($48,804/yr with reduced redundancy storage)
  • Glacier: 49,000 GB  *  $0.01/mo * 12 months/yr  =  $5,880

If you’re not sure if you need your data at all, I assume you’ll be ok with waiting a few hours to get it out of Glacier so you can pay 1/10th of what Amazon charges for S3 storage.

Next time you are considering throwing out potentially usable data to save money, weigh it carefully.  How much are you actually saving, and how much do you stand to lose if you want that data in the future?  If you have an example of data that isn’t worth it’s Glacier storage costs, I’d love to hear about it!  Write me at my first name last name @ mortardata.com.

- K Young

[BTW - If you’ve ever wanted your own Twitter dataset to experiment with, check out our newly open-sourced Twitter Gardenhose.]

Tags