Measuring Your Share of Voice Using DataSift

20th October 2014 0 Comments

In our latest of use case posts, we will look at two brands – Home Depot and Lowes – and compare their share of voice. Does one brand have more social presence than the other? Does one brand outperform the other in terms of volume or hashtag and mention usage? How do they compare geographically? Here’s a simple, end-to-end scenario for collecting, consuming and analyzing DataSift data to discover your share of voice. Let’s find out who fares better – Home Depot or Lowes.


These are the solution prerequisites:

  • Share of voice (use case)
  • DataSift platform (source for data)
  • Historics preview (filter evaluation)
  • Amazon S3 (data destination)
  • Alteryx (file handling and ingestion)
  • Tableau (visualizations)

The end-to-end process is as follows:

1. Brand identification
2. CSDL filter definition
3. Historics for preview and data acquisition
4. Configure data destination (Amazon S3)
5. Ingest and shape the data (Alteryx)
6. Visualizations in Tableau

Brand identification
The use case is important; it is where you start and where you end with this exercise. It defines the CSDL filter criteria and also what data points are of interest and what to measure within the output. The CSDL filter was constructed to capture conversations around the brands at the interaction level including hashtags, mentions, and brand links.

Historics preview
Once the CSDL has been defined, run a Historics Preview to understand the data volume and potential noise. The process uses a 1% sample over the period specified (up to 30 days) and returns useful information like interaction counts, interaction types, languages, and even a word cloud so you can identify any potentially noisy terms. To create a Historics preview open your stream and click on the “Historics Preview” button. Select the time period you are interested in (we recommend the Basic default preview). There is a small DPU charge for this exercise, but it is worthwhile when you consider the data costs should your filter contain a misstep. The volume you see in the interactions chart is a 1% sample over the period. Use this volume and extrapolate to get an estimate of what a 100% sample would return.

Historics preview

The time range for the Historics is for July 31, 2014 – August 31, 2014. The 1% sample returned 2,215 interactions. A full sample would return approximately 221,500 interactions.

Configure data destination (Amazon S3)
Before the data can be collected (via a DataSift Historics task), a data destination has to be defined i.e. where will DataSift deliver the data? We’re using the cloud-based file storage system Amazon S3 here, which is the ideal option for architectures where inbound firewall ports cannot be easily opened. You can create an Amazon AWS account here. Use this link to configure your Amazon S3 account for DataSift.

When configuring the S3 destination, be sure to use the format “JSON_New_Line”. This makes file consolidation simpler. A nice feature of the S3 destination in DataSift is the ability to prefix your files. For this exercise I will prefix all of my files with “SOV”.

Historic data pull
With the filter created and the data destination configured, the DataSift Historics task can be run. Open the stream and select Historics Query. Select the timeframe you want (<=30 days) and the sources of interest. For this exercise a 100% sample was used.

Historic data pull

The Amazon S3 Push connector will create a new file with each delivery so the resulting data set is a collection of files. Download the directory of files to a local machine.

Data ingestion and shaping
Alteryx is a powerful multi-purpose data blending and analytic tool. For this exercise it will be used for: Ingesting and parsing JSON data Data shaping Output file generation The Alteryx module first points to the directory where the JSON files are stored. Using a wildcard operator the input tool looks for all files in this directory so be sure only JSON files are located here. During the ingestion process, a unique record number is attached to each interaction and the JSON is then parsed into key-value pairs.

Data ingestion_1

The next step is to normalize the array fields that will be used for this exercise – hashtags, mentions, links, and brand tags. Then, only select fields are passed through to complete the data-shaping portion of the module.

Data ingestion_2

Data is run through a crosstab tool to turn rows into columns. In addition the date is reformatted and the data type changed so it will be readable in Tableau.

Data ingestion_3

Next the arrays are unlisted into rows, preserving the order of operations. The final output is delivered into Tableau data extracts(TDE files). The entire process for more than 200,000 records takes about five minutes to complete.

Data ingestion_4

The following Tableau data extracts will be created as part of this process:

brand.tde hashtags.tde

The Alteryx module can be found here

Below are example dashboards that can be generated quickly to visualize the share of voice data acquired in this exercise. Just open Tableau and connect to the data extracts created in the previous step and start creating.





All associated scripts and assets can be downloaded from the GitHub repository here

Share This