DataSift adds $7.2M: The story so far and focus for the future

2nd May 2012 | 7 Comments

by Rob Bailey, CEO

The last few months have been an amazing journey for us. It was only back in November last year that we launched DataSift. That seems like a lifetime ago when we look back at what we’ve achieved since then…

  • with the help and support of Twitter we’ve launched our Historics platform to unlock insights from 2+ years of Tweets
  • we’ve on-boarded 20+ new data sources (including YouTube, Blogs, Forums and NewsCred powered News )
  • we’ve added around 200 new customers, both entrepreneurs building their socially-intelligent applications to large enterprises that recognize that social is becoming table-stakes for business.

On top of that, as passionate Big Data junkies we held 50 events across the world last week as part of www.bigdataweek.com, including a live Twitter Q&A with Todd Park the CTO of the United States. Huge kudos goes to our whole team in getting us where we are today, but especially our founder, Nick Halstead, who has the vision and passion that keep us all moving.

Maybe our biggest surprise to me is the breadth of use-cases that we’re seeing companies use our Social Data platform for. Social Media Monitoring and “Breaking News” are obvious applications that companies build.  But we’re seeing everything from Business Intelligence, Stock-Trading models, Public-health applications and social-TV guides being built with DataSift.  A big part of Nick’s vision has always been to democratize the social data market – opening it to both small and large companies. We are starting to see this play-out.

But we’ve still got a lot of work ahead of us. Our combo of Social Data + Big Data + Real-Time has the opportunity to transform the industry. With Lean Start-up as our guide, we’ve tested and seen the demand. Now we need to grow faster to help us move faster in response to this.

That’s why we’re delighted to announce that our existing investors have invested additional $7.2M investment in DataSift to accelerate this growth. And I think that’s the real-story behind our story – having awesome investors that share your vision and are eager to double-down with us on this market. We’re lucky to have Roger Ehrenberg (IA Ventures) and Mark Suster (GRP Partners) as part of our team.  They’ve been instrumental in helping us get to where we are today.

So what’s next? We don’t want to spoil the news, but over the next few months you’ll see new services and applications from DataSift as we continue to build out our social-data platform. One of the biggest reasons that people choose DataSift is not just that we can filter and deliver raw social data at massive-scale, but we can structure it into a “ready-to-analyze” format with sentiment, Klout Topics Natural Language Topic analysis, etc. for companies to consume. We’re unique in providing this in the market. What we’re working on next will take these capabilities further to give our customers not just a way to monitor social interactions, but to measure and metric them.

And if you want to help shape the future of social and big-data, we’re hiring!

Building the Future

4th August 2011 | 1 Comment

I wanted to share something about how we view the future of the data ecosystem. We believe building an ecosystem of applications on top of our platform will radically change the data market and create a marketplace for businesses and developers alike.

We have spent four years learning about scale and cost to companies to build the kind of infrastructures required to deal with the data volumes involved with something like the Twitter Firehose. Not everyone may know but we also run a little website called TweetMeme – it deals with 500 million API requests and consumes 6Tb of bandwidth every day. With DataSift we have built something that is truly awe-inspiring in power and flexibility. We think in scales of millions of simultaneous streams not hundreds. We think of data processing that involves millions of complex decisions per stream, not just a few simple keywords. The future of data processing needs the power of the cloud and we ARE the cloud.

For those interested this is a diagram of our platform at a very high level –

We have also spent the last six months testing our platform with corporates – Fortune 500, Retail and Media, Financial Services, Travel and Education we have shown we have a scalable, flexible platform that meets whatever demands they have. We are new to the market but we know when they take the Pepsi challenge we always win.

It is easy to miss-understand our focus on developers but the reality is that behind every corporate is a team of developers. And we believe developers can change the world – give them the tools and they will build the future. Data aggregation and licensing is a huge technical challenge and thousands of companies waste massive resources re-inventing the wheel to build their own. But traditional models dictate that the barrier to entry is way too high for most companies, we break the mold giving on-demand access to a single tweet or a billion.

What makes us different?

  • Track Tweets from every person who follows Lady Gaga who also follows Barack Obama
  • Track 100,000+ Geo Locations simultaneously
  • Gender Detection, Political, Interest and Authority Segmentation
  • Pattern matching (via regular expressions) – like looking for every ISBN mentioned on Twitter
  • Real-time Sentiment Analysis and Natural Language processing (entity extraction)
  • Detect over 30 languages (in real-time)
  • Record every Tweet into what we call ‘BigStore’ for later retrieval or Map-Reduce

We look forward to inviting you all into DataSift very soon and building the future.

Visualising the DataSift Augmented Twitter Firehose

11th July 2011 | 0 Comments

Taking inspiration from DataSift Invaders and some other projects I’ve seen that take technology out of the ether and bring them into the real world e.g. the light painting WiFi project, I decided to take the DataSift augmented Twitter Firehose and experiment with making it more ‘tangible’.

After digging around in my experiments/toy box I found the following bits;

  • A TS-7553 embedded ARM SBC (single board computer)
  • A Velleman K8055 USB interface board
  • Some relays and a cold cathode.

The idea was to connect to the DataSift stream and illuminate a series of LED’s based on a users Klout score, if their score is high enough all LEDs and the cold cathode will light up.

Hooking it all together and writing an interface to DataSift took the better part of a morning but as a proof of concept I think it works quite well;

I’m looking to make something a bit more polished based on the idea and will update as I go @NetworkString

DataSift Space Invaders

What happens when you use the Twitter Firehose to power an arcade game? Well you’d end up with DataSift Invader! It’s a straight forwardsimple looking game that uses various parts of the DataSift platform to create a game that would be incredibly difficult for any other platform to create.

So what parts of the platform are used?

  • Sentiment analysis – The Twitter Firehose is sampled and the sentiment of each tweet (on a scale of -20 to 20, the negative number meaning a negative sentiment) is used to move the DataSift ship.
  • Social authority – If any user with a Klout or Peerindex score above 75 tweets the DataSift ship will fire a missile
  • CSDL – Using various CSDL filters we can find in real-time when a user mentions @DSInvader or #DSInvader, run the sentiment analysis and check the social authority.
  • Tagging – Using a single stream we can collect all of this information by using tags for “position”, “shoot” and “user”
  • Streaming – The streaming API is used to get the updates in real-time
  • NodeJS Consumer – One of our open sourced tools in a DataSift NodeJS Consumer. The whole application is run in NodeJS and therefore using the consumer worked really well.

How do you play?

  1. Send a tweet mentioning @DSInvader
  2. Wait for barely a second and you will appear as a target for the DataSift ship to shoot
  3. You will receive a tweet to let you know if you got past the ship without being hit, or if you were hit by a missile.

HINT: We also run sentiment analysis on all tweets containing @DSInvader, so the more positive you are the faster your avatar will move around the screen, increasing your chances of success!

There will be a more detailed engineering post and we will also be open sourcing the code. So what are you waiting for? Have a go!

 

DataSift Status Dashboard

30th June 2011 | 0 Comments

Now you can find the latest status updates about DataSift in our Status Dashboard.

Check this section any time to get current status information, or subscribe to an RSS feed to be notified of interruptions to each individual service.

 

Banned Users

22nd June 2011 | 1 Comment

At present we are switching over our systems to our automated licensing engine. For a short period users who do not have an approved license will be banned if they try to access Twitter data. Please contact support@mediasift.com and we will un-ban you if you experience this problem.

In order to ensure that you do not get banned again please add the following CDSL code to your Streams, so that you do not access Twitter data:

( your code here )
AND
Interation.type != “twitter”

Alternatively if you would like a license for Twitter data, please sign the license electronically at the “My Licenses” section of your User Dashboard. Then a member of our Sales Team will contact you, as you will need to complete a Beta Commercial Contract. This is a chargeable service.

Twitter Photo Entities

1st June 2011 | 0 Comments

Today Twitter announced that it will be soon releasing a new integrated photo sharing service. Because this service is part of Twitter it means a new set of meta-data will become available for filtering in DataSift. Twitter have already documented the new entities and we will be adding these to our targets as soon as they become available in the Twitter Firehose.

We will be looking at what filtering potential these new entities give our customers but we are excited by the new offering and what it means for the future of media filtering.

Trends

19th May 2011 | 0 Comments

Trends are a built-in feature of DataSift. Trends aggregate the trending topics from external sources. At the moment we aggregate data only from the Twitter but we will try to add more sources in future. DataSift collects data about trending topics and then for each tweet. DataSift also checks if a trend is part of other trending topics.

Trends return the following information:

  • Content – The keyword or phrase of the trending topic
  • Type – The type of the trend (daily, weekly or a location)
  • Source – The source where topic was aggregated from (for example: twitter).

 

The advantage of using DataSift Trends is that allows users to look what is trending on Twitter from other sources.

Twitter Partnership

4th April 2011 | 1 Comment

Mediasift and Twitter have partnered to make Twitter data commercially available through DataSift. This means that developers interested in building tools to monitor and analyze Twitter data can now filter Tweets from the full Twitter Firehose, using DataSift for non-display analysis.

Companies and marketers are demanding better ways to listen to their customers on Twitter, and understand conversations about their brands and products. With so much news and data being created across multiple social networks, businesses need a way to cost-effectively and efficiently filter it all down to find the information that is valuable and meaningful to them. DataSift lets companies use extensive search queries to access only the data they value.

As a company we have been very fortunate to have access to the Twitter Firehose for quite some time. This has enabled us over the past two years to refine our thinking, leading to the incarnation of DataSift.

DataSift aggregates multiple social media feeds, augmenting additional data sets, and creating a common abstraction layer (through CSDL), to provide meaningful insight into unstructured data chaos! But we wanted to do more. We wanted to revolutionise the economics. So we built a Pay Per Use subscription model.

It has taken us nearly 18 months to complete the platform. The experience of TweetMeme and serving literally billions of requests a month means that our platform is truly scalable. By leveraging our cloud-based platform, you only pay for what you process, consume, store and analyse. This is the heavy lifting done.

Now, we are on the verge of opening the platform to consumers, businesses and most importantly developers to expand the infinite possibilities for end use applications. Since the Alpha, we have been bombarded with commercial opportunities in every sector.

We are hugely excited to continue our great relationship with Twitter, and look forward to what our community will build using the power of DataSift.

Updates to the CDSL

11th March 2011 | 2 Comments

We have been working hard over the last few weeks to improve our filtering engine, in both efficiency and with new features.

Firstly, lets cover the differences between the last iteration of the CDSL and this new version.

Changes to CONTAINS operator

We have a new more efficient way of searching for keywords and/or phrases that now replaces the old implementations of CONTAINS, CONTAINS_WORD & CONTAINS_PHRASE operators, and has merged them into a single CONTAINS operator (CONTAINS_WORD & CONTAINS_PHRASE are retained for backwards compatibility) This change should not affect the behaviour of most rules, and in some cases improve the expected results, as CONTAINS now matches whole words and not subsections of words (i.e. the Scunthorpe problem). If however you were using CONTAINS to explicty search for subsections of words, we now provide the SUBSTR operator that retains this old behaviour of CONTAINS.

Changes to text operator arguments

The other change to the existing CSDL is how we handle text arguments (i.e. “quoted” text) that are used for the operators that work on text Fields. We have had added escape sequences to obtain certain characters , \ : “ <newline> <carrage_return> <tab> within an operator’s argument. This means that the CSDL compiler will no longer accept a single \ to use \ in your search it will need to be escaped like this: \\
For most of the text based operators, this change will not affect any existing rules. However for the Regular Expression based operators REGEX_PARTAIL & REGEX_EXACT there is a high probability that changes will need to be made to their arguments due to the increased likelihood of the \ character being present.

We have made this change to enable users to search for terms containing some of the control characters that are present within the new operators that we have added. See here for mappings

Now, time to look at all the shiny new bits

Introducing the ANY operator

Now that our alpha users have had time to play about with the CDSL, they have started to create streams that search for ever increasing numbers of terms, from @users & brands to an exceedingly long rule searching for rude words that we didn’t even know existed. One thing all of these rules have in common is that they are all very long chains of interaction.content contains “term” connected by OR’s and are rather cumbersome to use. Like all good developer’s our team like to do as little typing as possible when they can get away with it, and thus we came up with the ANY operator. This allows you to specify a comma separated list of terms to search for (using the new CONTAINS implementation) that will return true as long as at least 1 of the items in its argument matches the target.

For example, searching for phone manufacturers used to be written like this:

interaction.content CONTAINS “HTC” OR interaction.content CONTAINS “Nokia” OR interaction.content CONTAINS “RIM” OR interaction.content CONTAINS “Apple” OR interaction.content CONTAINS “Samsung” OR interaction.content CONTAINS “Sony”

Can now be shortened to this:

interaction.content ANY “HTC,Nokia,RIM,Apple,Samsung,Sony”

Introducing the NEAR operator

This is another operator that was born out of our user’s feedback. When searching for multiple terms that all have to be present for a match to be successful, it sometimes helps if all of these terms are close to each other. Particularly if they are processing a stream of blog posts, which can have several thousand words each interaction.
By using the NEAR operator you can specify two or more words that have to be present, as well as the maximum number of words that they can be apart from each other.

interaction.content NEAR “fish,chips:1”

Will match “fish chips” “fish and chips” “fish n chips” “fish & chips”

Page 1 of 212