A new year and we are super excited about what is coming in 2011. Big data and real-time are hot topics and we feel that DataSift is ready to take on the challenge of assisting our customers with the problems of dealing with finding the right data and soon how to analyse it.
So what did we do in 2010?
DataSift was built as a platform for consumers and businesses to sift through the real-time web to discover content. We launched the Alpha three months ago and within a month had several thousand signups (helped in part by making it into the final six at Techcrunch Disrupt.
The development team had a lot of very challenging technical problems to solve working with vast quantities of real-time data and building a platform that will be completely future proof. We estimate that by end of 2011 we could be processing over 1 billion pieces of information each day. On top of the scale of sifting all the data we have also put a lot of effort into a scalable distributed delivery system that support Web Sockets and HTTP Streaming (plus old style REST API).
We have also put a lot of effort into integrating new sources of data. The Alpha launched with 100% of the Twitter Firehose but we have quickly added MySpace, Google Buzz, SixApart and Digg (all of them 100% real-time). 2011 will see many more sources from location based services to more social networks.
The CSDL language has grown quickly in the last three months adding in list support, word and phrase matching and a whole lot more.
And what can you expect for 2011?
1) Data Storage – We know that real-time delivery of our curation content is only one use case and that storage is important to many of you. Thanks to our experience with TweetMeme (http://tweetmeme.com) we know how to deal with large scale data storage but we wanted to take things one step further with DataSift. Firstly all streams will be able to be ‘one-click’ recorded – (we also allow scheduling) – once recorded (or while still recording) the data can be exported or accessed via our API.
2) Data Processing - Collection of data is only part of the story, we are also building a processing platform around HADOOP that allows recorded streams to be post-processed which then can placed back into the Data Storage. These can then again be exported or access via our API. The scripts for doing the data processing (I can’t give away yet what we are using) will also be collaborative (like our stream building) so our users can share and re-use. We are excited to see what business intelligence and analyics applications will be built upon this new platform.
3) Graphical Interface – The current CSDL language requires a reasonable amount of technical knowledge – so we are building a GUI that gives easier ways to build complex streams. This will incorporate a way to incorporate 3rd party tools to allow easy importing of external data.