Behind the Scenes: How We Identified WordPress’ Top 10 Content Categories

28th August 2014 0 Comments

Yesterday we showed the top 10 WordPress categories worldwide. Wonder how we did it?

Each month, DataSift receives approximately 15-25 million full-text WordPress blog posts into our platform. Recently, the question of what topics are discussed on blogs was raised by one of our customers. Rather than simply say “everything” – we decided to take a deeper look into the content and substance of WordPress blogs. Using DataSift’s Salience Topics and Salience Entities augmentations, we can quickly categorize the sorts of posts coming into the platform. After grouping DataSift’s standard 40 Topics into 10 higher-order categories, we wanted to see what Entities (people, companies, places, things, or ideas) showed up most often for a given Topic.

The Dataset
DataSift’s WordPress data source covers all blogs hosted on WordPress.com, as well as any WordPress-based site running the Jetpack plugin. The WordPress data source provides several different types of interaction (including comments, posts, and likes). To keep things manageable we’ll take a 10% sample, and make sure we’re only looking at English language blog posts where both the entity and topic have been successfully identified.

interaction.type == "wordpress"
and wordpress.type == "post"
and language.tag == "en"
and language.confidence > 80
and salience.content.entities.name exists
and (
  salience.title.topics exists
  or
  salience.content.topics exists )
and interaction.sample < 10

Processing and Counting
Once we had been delivered the data (in this case, we used an Amazon S3 bucket and line-delimited JSON files), the counting of topics was fairly straightforward. We accepted that we weren’t looking for an absolute count and were willing to count stories with multiple topics more than once. While this introduces the possibility of double-counting errors (specifically as topics bubble up into higher-level categories), our spot-checking revealed very few posts had more than one topic, and as this is only for a high-level overview, we’re OK with that.

Iterating through each line, we looked for the topic, and then created an object containing all the keywords we saw, and how frequently we saw them. We used this ruby script to iterate through our JSON data and pull out the elements, and added to the count of each. Our ruby Hash object looked like this:

{
    "Mobile Technology": {
        "Google": 45,
        "Apple": 30
    },
    "Software&Internet": {
        "Google": 423,
        "Yahoo": 313
    }
}

Further Research
The advantage of having the full text is that we can use our high-level category and object identification as a starting point. From here, we could dive into a particular topic, or see why certain keywords are showing up on a particular topic. As with DataSift’s other augmentations, the goal here is to reduce your license cost and give you a broad brush to paint with, while leaving the fine-grained filtering to more nuanced CSDL rules.

Share This