The Twitter/Gnip Gap: Data Licensing vs Data Processing

13th April 2015 14 Comments

The news of Twitter’s termination of our data licensing contract will mean disruption for hundreds of companies. For most businesses, this will not just be a case of switching from one supplier to another, over 80% of our customers leverage capabilities that do not exist in Gnip.

This blog post is for those of you who wish to continue using Twitter data. It highlights the main differences that need to be taken into consideration when planning a transition to Gnip. It will also help you identify the features that you will need to deprecate in your own product as there are no workarounds possible.

A fundamentally different approach: Gnip data licensing vs DataSift data processing
A fundamental question to answer is: “If DataSift and Gnip both have the Twitter firehose, how come 80% of DataSift customers use unique capabilities?”

As any developer knows, to extract insights from the 500 million Tweets-a-day is no simple task. The basic steps in collecting and preparing data for analysis are:

Data Extraction/Filtering: Each Tweet is 140 characters of unstructured text. 30% contains links to content on other sites. How do you decide which data is relevant for your analysis? Sifting the data you want from the data you don’t is a text-mining and filtering problem.

Data Enrichment?Interpretation: How do you interpret text within each Tweet to extract its meaning? For example, understanding the sentiment, topic or intent that’s expressed within a single Tweet. Without this, all you can do is count up vanity metrics on the number of times that a brand name was mentioned. Solving this is a text-analysis problem.

Data Delivery: How do I continually deliver large volumes of the enriched, filtered data into my own platform, ready for further analysis? Given the volumes of data being delivered in real-time, this is a hard problem to solve. Developers want to know that data is guaranteed for delivery, buffered if there is a problem in their own infrastructure, and can easily be mapped to the target database schema they want to receive data in.

Data Licensing: Finally, you pay for the data you received from the firehose. This is the final transition that takes place at the end of the month. Pay for what you received, at the Twitter rate of $0.10 for every 1,000 Tweets.

The focus for DataSift has always been to provide an integrated platform to do the “heavy lifting” across all these areas, enabling developers to focus on building insights, not infrastructure. In contrast Gnip has focused on providing simple data extraction capabilities, and data licensing. 80% of our customers will have to build new infrastructure.

Capabilities you’ll lose when transitioning
We’ve written a more detailed developer blog post to cover this in more detail, but at a summary level, here is a checklist to get your started in thinking about a transition plan. The goal here is to highlight the features in DataSift and the Gnip gap.

ds-gnip-high-res-01Historic access
More than 50% of our Twitter customers leverage DataSift to process historical Twitter data.

When accessing historic data with DataSift you have access to the same augmentation, filtering, classification and delivery features as when accessing real-time streams. Our platform provides a consistent experience across historic and real-time data access.

Accessing historic data through Gnip has the same limitations as their real-time service for augmentation, filtering and classification options. Also, results are returned as sets of raw files leaving you to handle integration into your application, rather than benefit from the seamless delivery features provided by DataSift push connectors.

Closing the gaps
We remain committed to enabling our customers gain insights from the universe of social data sources. However, our options to assist customers in closing these gaps are limited, especially given the short deadline of August 13th for transitioning your application to Gnip. We are evaluating options of how we can best assist and will post an update on this in the coming days.

  • Claude Gibert

    Very good summary, Tim. I would like to add to your list Historic Preview as a inexpensive tool to estimate the cost and relevance of historic retrievals.

  • acotgreave

    Great summary Tim. This decisions is going to cost us a great deal of time and money and send us backwards in terms of our analytics. Right now my sentiment towards Twitter inc would be easy to measure.

    • Tim Barker

      thanks @Andy. We’ll follow up with you on the options. Tableau integration leverages a good number of these capabilities, both enrichments (eg sentiment), classifications (tagging) and data delivery into databases that support Tableau.

  • Michael Alatortsev

    If anyone is stuck, we do a lot of similar Twitter data analysis @iTrendHQ, will be happy to help with migration or alternatives.

  • Fabio De Bernardi

    Perhaps a naive question… but wouldn’t it be possible for DataSift to buy data from Gnip at whatever the standard cost is and resell it to your customers? I would expect lower margins for DataSift, but since Twitter is not the only platform covered that would be a sort of “loss leader”. Unless Gnip’s T&Cs don’t allow that…

    • Zuzanna

      Hey Fabio! Not a naive question at all. It’s not that easy to wrap your head around exactly what it is that we do. The pain of being the only kid in the industry who does ‘that thing’. Your solution would work if DataSift was an analytics company, like our customers. But we’re not. We’re a data processing company. You know: taking in any unstructured data, normalizing it, categorizing it, filtering it, augmenting it and letting our customers to find the answers they need for their products and their customers. You know how much I love my food. It doesn’t fit perfectly as an analogy, but I always think of DataSift as my LEM meat grinder. It’s industry strength grinder – it can demolish an entire deer in no time. Obviously, it’s not just for deer. It will grind pig and cow and rabbit etc. DataSift will take in the meat, perhaps add some garlic, marjoram, grind it, but we don’t make sausages. Our customers do. And their customers eat them. Boy, I’m hungry now. Does this make it any clearer?

      • Fabio De Bernardi

        It’s now clear to me that you can grind a pig, cow or rabbit with your LEM grinder, not only a deer 🙂
        On the other front, what difference would Gnip data vs the data you now pull from Twitter? Anyway, I don’t want to open an endless debate here 🙂

        • Zuzanna

          Data? Zero difference. Same firehose. But, the companies that DataSift and Gnip are working with do not have the infrastructure to handle the firehose. So the difference between us is in how Twitter data is handled/pre-processed before it’s delievered to them. See the above blog post about the functionality gap. I also expect that there’s some confusion between Twitter firehose and Twitter data. DataSift, Gnip and NTT were the only companies with the firehose access. Lots of companies have access to pre-processed Twitter data.

          • Fabio De Bernardi

            Ok, the difference between firehose and pre-processed data is interesting… as in, I don’t know the difference (although perhaps I should). My point is, if you can still get all the tweets you want/need by paying whatever Twitter wants you to pay, why not buy them, throw them into your ‘grinder’ and continue to provide the data you were providing to your clients? If this is possible (or maybe the solution is in the firehose vs pre-processed difference?) then it becomes purely a financial matter. Not that that’s irrelevant, but I’m sure it can be considered secondary.

          • Zuzanna

            I know what you’re asking – yes, it’s possible, it won’t be the same data (it will be pre-processed/devoid of nutrients), but we’re currently working on what the best solution for our customers is. Needless to say, they are our primary concern.

  • Sankar Nagarajan

    Would be happy to explore possibilities to help companies migrating to GNIP such as for instance custom GNIP Adaptor development (API integration) , Data enrichments, Custom Filtering, sentiment analysis and topics processing some of which we already have @_textient

  • Steve Butterworth

    This is a real pain for your customers. Datasift process and augment the firehose in a way no one else does and its really hard to do. As well as using datasift at Flumes we also use twitter data direct for some customers and augment it ourselves so if this is useful to any customers scratching there heads as to how to move on do let us know @flumesmedia

  • Converseon

    Hi Tim — we’ve been there too so understand (and sympathize) with the challenge. We were an original firehose partner in early days before being transitioned to the aggregation partners (gnip) and definitely these changes are a challenge, esp for the customers that don’t have robust programmatic capabilities. I also agree that the real (and growing) needs are the advanced filtering/analytics/classification technologies to find the signals from the noise. The question is whether these should exist at the broader aggregation level — like datasift — or closer to the customer so that the feeds are configured/filtered more specifically to the brand/vertical to integrate into their organization (like we at Converseon/Revealed context do). I think the answer is likely both. I anticipate/hope we see the emergence of more marketplaces so that organizations can get the best of both worlds — broader analysis at aggregation level but also allows brands to further filter/analyze all the data across sources with analytics tuned to specific brands/verticals/and with custom NLP/classifiers to consistently to get to “one truth” in the organization and also allow them to use their preferred analytics engines that are embedded in their organizations. That certainly would be competitive advantage.

    • Tim Barker

      Hi @converseon:disqus, i agree, the answer is both. We want to enable unified access to the universe of data, but we can’t get religious about which algorithms are best to interpret. What we’ve tried to do so far is provide the capability to operationalize a custom classifier that you can build/train (we call this VEDO). Over time, we want to enable more options for this so that companies can leverage an integrated stack as well as to plug-in best-of-breed NLP/classifiers. That’s not easy to do, but if companies have a preferred approach for analysis, we can make the whole market bigger and more innovative if we can support this. Give us time and we’ll get there!

Share This