Biting into Social Data for the Business Analyst: a Primer

5th June 2014 0 Comments

In the age of Big Data, with so many powerful apps available for social listening, engagement and analytics, it’s easy to forget that in many digital agencies and research firms, CSV files, pivot tables and (gasp!) Excel charts are still ubiquitous. The truth is, for certain types of monthly or quarterly reports, they just work; they’re reliable and easy to use.

The challenge is that these tools don’t accommodate big social data use cases: millions of tweets, Facebook updates, discussion threads and blog posts. Like a Chicago-style hot dog, social data is quick and messy.

Don’t let this intimidate you. I’m here to assure you that, with the right filters in place, you can get to the right quantity of social data and in most cases, use the tools you already have in place. Once you get started, you can determine whether your datasets are small enough to simply carry on using periodic reports generated from CSV — or integrate an application with the DataSift platform that better suits your needs.

Let’s look closer at how we can get this done.

Three Tips for Making Social Data Manageable
Here are a few ways you can take advantage of all of DataSift’s rich filtering and augmentation capabilities while working with raw social data in the DataSift platform:

    1. Use Live Preview or Historics Preview to get a sense of your volumes.
      If listening to real-time data, you can get a glimpse of how many interactions per second your filter will return before you start recording it. You can take a similar approach when doing historical research, by looking at volumes across a 1% sample of our archive. This way, you’ll have a sense of how much data you’ll bring back, and whether you’ll be ready to handle it.

    2. If volumes are too high, reduce!
      A variety of filterable fields across our data sources can be used to reduce throughput volumes to manageable levels. We call these filterable fields “targets” in our filtering language, Curated Stream Definition Language (CSDL). For example, you can use interaction.sample to receive just a small percentage of the conversations matching your other criteria, such as keywords, usernames or geo coordinates.

      Also, utilize tools like named entity recognition (Salience Entities) to disambiguate mentions of common-word brands like Tide, Dove, Gap, Target or Apple from noise mentions. Another approach to reducing volumes is using the contains_near operator to only return conversations in which specific words appear in proximity to each other.

    3. Start with formats you know.
      After recording a live stream to DataSift Storage, you can export your recording via CSV right inside our user interface. Be aware that Microsoft Excel 2013 caps at about 1 million rows and typically you’ll need at least 4GB of RAM on your machine to work in a file around the max size.

Also not unlike a Chicago dog piled high with hot peppers and relish, social data can seem intimidating at first. But once you take a bite, you’ll understand what all the fuss is about.

Written by Jay Krall

Jay Krall is a Technical Product Manager for DataSift based in Reading, UK. He's originally from Chicago and likes hot dogs. Connect with Jay on LinkedIn