For over two years, DataSift has been hard at work making access to Social Data easier in a number of fundamental ways. We started by firstly approaching the market in a new way – ingesting ‘everything’ and bringing a consistent approach to the access, filtering and delivery of the data. This was no small task – we were taking on the Firehose of all the major social networks and pushing them through a single pipe, which required new technologies to be developed and even the use of cutting edge hardware to deal with demands of latency and bandwidth beyond most data centre’s dreams.
The advantages to our customers quickly became evident, the simplicity of being able to treat all data as one, a common framework you could say, and most importantly, the power of a language that was designed for the purpose of dealing with unstructured data. Lastly over time we built out ever more elegant solutions for delivery, starting with real-time (and I mean sub 200ms real-time) and then enterprise bulletproof data delivery mechanisms that can transmit terabytes of data without losing a single byte.
The market quickly adapted to our new found powers and the ecosystem of smart companies building ever more powerful Social solutions has been amazing to watch. Development time and cost has radically been changed and customers are ever more focused on the analytical problems rather than data integration problems.
A year ago we introduced historical access to Twitter (and now most other data sources). This was a labour of love for the engineering team and me. My principle approach is always simplicity – I wanted the access to use the same language, same powers and importantly the same delivery methods. No commercial (or Open Source) solution existed that did what we wanted. This meant taking a number of open source projects and then throwing most of it away and building something revolutionary. The end results though, we think speak for themselves. We have delivered a consistent approach to access data letting customers switching between real-time and historical to let them focus on the questions to ask.
I always thought that the ‘Big Data’ trend was about two things, firstly that most companies now had many data sources and that in most cases they were in different formats and different structures (schemas to the technical), and joining them was extremely difficult. Secondly, that a lot of the data was unstructured and could not be queried with traditional databases. A lot of new technologies have been developed to solve both of these problems – but actually something happened along the way. In reality a third problem became the focus, that of scale. Companies found ever more that the data they were generating were beyond the capabilities of the older traditional solutions.
The first two I have started to call the ‘elephants in the room’ – the solutions are there but they have proved to be expensive and in most cases require lots of developers. They also require a new rare breed of ‘data scientists’ to tackle even the most simple of problems.
Social Data is very much a Big Data problem – the data generated each day is beyond the reach of 99.99% of businesses and to store even a fraction of it is a challenge. Second the content itself is in the most part unstructured. If you look at a Tweet – there is almost nothing you can do to it in a purely analytical sense other than count it. For two years I have been wrestling with how we could make both understanding the data simple and to bring it into context for business.
So today we are announcing VEDO – an extension of our core platform that brings programmable intelligence to the masses. Building upon our incredibly rich text pre-processing and parsing capabilities, we have added a whole new engine that allows customers to take advantage of advances in machine learning, statistical models, rich taxonomies and much more all through a simple and unified approach. As with the rest of our platform, we want to reduce the cost of developing this kind of functionality for our customers and let them focus on innovation and not on infrastructure.
VEDO brings the power to understand the context and the meaning of the content itself. It can be trained to understand any subject and to contextualize it so that the data can be inherently joined to other structured data within the business. This to me goes to the heart of the value of Social – bringing it together with other business data to set it in context and allow customers to understand why and how Social is impacting them and be able to make decisions off the back of it.
We looking forward to seeing the next revolution in Social Applications and Social Data delivered into the Enterprise.
Check our developers blog for a deeper overview of VEDO.
One last thing – this is just the beginning, DataSift has always been about social and we will continue to make it easier than ever to find, to understand and to integrate social data. But DataSift is a Big Data platform that understands unstructured data. To fully understand social you also need to understand other unstructured content within your business. We have already integrated private enterprise social networks and we will soon allow any unstructured data to be delivered into the platform and used alongside social.