Wrangling the Twitter Firehose at 4,629 Tweets-per-second

18th June 2012 3 Comments

With the recent news that our friends at Salesforce.com have signed an agreement with Twitter to access the full firehose of public Tweets, we thought we’d give an insight into what it’s like being on the receiving end of this torrent of data. And of course, we’re not just doing this for Tweets, we’re doing this processing for public Facebook posts, and across nearly a million blogs, forums and news sites.

But first, a refresher on the astronomic scale of Twitter:

Just over six years ago on March 21, 2006, Jack Dorsey broadcast the first public Tweet to the world.

Three years and two months later, Twitter reached a milestone for it’s first billion Tweets.

Now, a billion Tweets are sent every two and a half days. That’s 400 million Tweets every day… 4,629 Tweets every second. In the time you have taken to read this blog post so far, more than 50,000 Tweets have been sent!

At the DataSift data centre, this equates to an additional two Terabytes of data stored every day. But storing it isn’t all we do.

From unstructured data to insightful analysis

As every single Tweet comes to us, our Big Data platform starts the crazy amount of work to structure, analyze and then pass it on to customers that are running filters on our platform.

Firstly, we structure the data we’ve received into 77 different fields – including the user’s name, ID, description, location, time zone, followers count, following count, etc.

If the Tweet contains a link (about 20% of them do), we’ll go and grab the full URL of the site (that means we have to unpack every tiny link from t.co, bitl.ly and other shorteners) and the title of the page.

We then analyze the data further to add up to an additional 34 fields per Tweet including:

Klout – Both the users Klout score and influential topics are detected.

Twitter Trends – DataSift keeps an eye on real-time trending topics. When any tweets match these trends, they are tagged with the appropriate trending topic.

Sentiment – Through advanced text analysis software with our partners at Lexalytics, DataSift is able to detect positive and negative sentiment in tweets. In addition, salience topics are identified and scored.

Topic Analysis – We’ll classify the Tweet into a high-level category that relates to the content – for example: fashion, technology, or finance.

Entity Extraction – We use Natural Language Processing technologies to identify each of the people, places, products or companies mentioned, along with the sentiment to each.

Demographic – The user is tagged as male or female (or unknown).

All this means that when our customers filter from the Firehose, they’re not just receiving raw data, they’re getting insights to help them analyse trends and patterns in the data.

And what can you do with all this Twitter data?

The possibilities really are almost endless; we’re amazed with the applications people are creating using social data, from social monitoring applications to medical applications… even to track disease outbreaks!

In building custom streams, you could search for Japanese tourists who are in Canada to ski, by filtering for user location, tweet location and relevant keywords. You could find out the most popular movie premieres in Los Angeles, New York, and London by filtering for movie titles, locations, and tagging positive and negative sentiment. You could also track the social reach of a Wall Street Journal story by filtering for the original link and tracking the number of retweets. And by combining social data with other sources, you can even use Twitter as a stock trading tool.

Over the next few weeks we’ll be posting more about what you can do with social data and how the social innovators are leading their industries. In order not to miss this, make sure you subscribe to the blog through RSS or email.

If you’d like to be featured in an upcoming blog, please do let us know!

Written by Andi Caruso

Andi Caruso is our Marketing Intern. Connect with Andi on Google+