Getting a Deeper Understanding of your Content

Reading Time: 10 minutes

One of the biggest struggles for many companies that produce data and collect data from millions of their customers is the notion of “Big Data”. The term “Big Data” is a coined term that describes a massive amount of information where processing it effectively becomes the challenge itself. This big data problem mostly comes in form of unstructured data (e.g. web pages, blogs, forums, legal documents, social media, call center transcripts). Some estimates say that 80% of organisational data is in unstructured data format. This data can be transformed into a meaningful structure which allows it to be utilized in tasks such as: gaining competitive business intelligence, analysing customer feedback, sentiment analysis and trend spotting.


The area we are focused on in this post is automatic categorization of text data. A popular method of text categorization is known as topic modeling. Topic modeling is a technique which comes from machine learning. For a broader look at topic modeling see this post by Journal of Digital Humanities.


The aspect we are focused on is the article content, however, there are some other key features that could be used to filter and analyse documents. These include article title, description, keywords, document author and publication time. In addition to this we also have anonymous data about how readers engage with the content, such as how many people read the article and how long they spent reading it.


Before applying any machine learning algorithm text must first be preprocessed, this helps to remove noise as well as the reducing the dimensionality of the data, which improves the topic model algorithm’s performance, speed, accuracy and also helps to avoid overfitting. Some typical pre-processing steps include:


  • Stopword removal remove words with little meaning such am, and, are, as etc.
  • Case Conversion – Setting all words to their lowercase form.
  • BOW – Transforming the content into a bag-of-words representation where each document is represented as a set of words.
  • Upper/Lower Bound Filtering – Words that appear too frequent (e.g. above 95% occurrence) and words that appear very infrequently (less than at least 3 documents) get removed.
  • Stemming and lemmatization transforms words into a common base form. Stemming is a crude heuristic process that chops off the ends of words whereas lemmatizer tries to do things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form (get, getting, gets turns into get) see source for details.
  • TF-IDF – Term Frequency – Inverse Document Frequency (TF-IDF) is a method frequently used to normalise the terms in a corpus. Normalizing helps to set a weight to each word improving the results. Basically, Term frequency counts show how many times a word appears in a document, this identifies if the word is important in that document. Document frequency counts if the word appears throughout the documents in the corpus, this gives a sense if the word is unique or common. The Inverse in IDF puts more weight to infrequent terms than frequent terms. The product of these two values gives a weight to each word scaling it relative to every word in the corpus.


Each of these steps describe very typical text processing techniques that are commonly used in text preprocessing tasks. These bring text data into a transformed state for machine learning or text mining algorithms. In many cases the final result looks like a document-term matrix (DTM), which is a matrix where each row represents a document and each column represents a term. Each cell therefore represents the relative importance of a given term in a given document.  Once the text data has been transformed into a format which the topic modelling algorithm can use then the next step is to apply the algorithm to the transformed data.


The objective of the algorithm is to look at how terms occur in sentences and in documents, pick up on recurring patterns in order to group words together to form “Topics”. Once the topic model has been created, then it can be used to probabilistically assign each document to one or more topics.


There are a number of topic modeling algorithms which include: Latent Semantic Analysis (LSA),  Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF). For a deeper dive into how LSA and LDA works see scottbot blog’s post. Our focus in this blog is in the application of NMF in topic modeling. NMF was originally described in a paper by Lee and Seung, since then it has gained popularity in topic modeling area. NMF uses linear algebra (factorization) to decompose a document-term matrix into topics. Figure 1 shows the input matrix A which contains n documents and m terms across all n documents. NMF decomposes this into matrix W where again n is the number of documents and k is number of topics. In matrix H, k is the number of topics and m is the number of terms across all documents.


Figure 1: NMF overview, k – number of topics, n – documents, m – terms


A simple example of this can be found at Greene’s Github tutorial slides along with practical Jupyter notebook examples . A simple toy example will help to illustrate the concepts further. Take Matrix A in Figure 2. It consists of 6 documents and 11 terms (n x m). If we set k number of topics to be 3, A might decompose into W and H shown in Figure 3. We can see in Figure 3 that Topic 1 seems to be about Weather due to the highest ranking terms temperature, rain, cold and heat. Likewise, Topic 2 seems to be about Cooking as the highest terms are heat, oil, salt and garlic, whereas Topic 3 seems to be about Fashion. Each cell in matrix W and H contains a value called a weight given to it by the model. These weights help to rank the documents and terms relative to each topic. Note that all the weights consist of a positive value which is where Non-negative aspect of title Non-negative Matrix Factorization comes from.


Figure 2: NMF matrix A, n – documents, m – terms


Figure 3: Matrix W – ranked documents on topic, H – ranked terms on topic


The domain we have applied our topic model to is a fashion/entertainment focused online magazine. The goal is to gain an insight into what type of topics readers are drawn to. The topic modeling algorithm does not automatically produce a label so each topic was labelled based on the top three terms for that topic.


The model was set to identify 35 topics (k=35). These topics have been ranked using an average user engagement time of that topic, i.e. the average time readers spent on articles believed to be part of a given topic. Some topics a very broad topics such as cooking, cosmetics, weather and health, that don’t focus on one particular thing. Many of these broad topics occupy the top half of Figure 3 as they have higher user engagement time, these include: diet, healthcare, skincare, haircare, makeup, cooking, parenting and wedding. On the other side of the spectrum are very narrow topics that focus on covering particular entities such as Cheryl Cole & Liam Payne, Ronan & Storm Keating and Stephanie Davis & Jeremy McConnell. Some entities become more important (in terms of topic word co-occurrences) than the event they are participating in, for instance Celebrity Big Brother (CBB) also appears to some extent as a topic, however because Stephanie Davis appeared on CBB and got 2nd place the statistical model sets less weight on CBB and more on the celebrity.


Figure 4: Average engagement time for 35 topics (Top 3 term labels)


Looking at content in such a way gives us an interesting perspective. It allows us to measure and understand whether the content a publisher writes about gets the right response from their audiences. For instance Diet & Food seems to be a topic readers care more about then Clothing or TV shows, at least in terms of user engagement. Likewise, we can compare the popularity of celebrities. For instance the love life of Cheryl Cole & Liam Payne attracts more attention than that of Katie Price & Kieran Hayler.


Seasonal topics also appear, such as Christmas, Celebrity Jungle (I’m a Celebrity…) and X Factor, all of which occur for a limited time once a year. These topics might get a high user engagement time within a given time period (e.g. people are more likely to read about Christmas articles before Christmas or read X Factor highlights after hearing about X Factor gossip), however, because we are measuring user-engagement across 2 years their average user-engagement is lower overall.


Observing topics over 2 years doesn’t give us much details of the topics performance from a temporal perspective. Now that we have segmented the content and combined it with user activity we can observe user-topic behaviour over time. Let’s have a closer look at two T.V show topics that appear to be popular in the same season, Celebrity Jungle and X factor.


Figure 5: Publication count aggregated by week within their topics


Figure 6: Page view sum aggregated by week for each topic


The first figure shows the number of articles published per week for the two topics X Factor and Celebrity Jungle, whereas the second figure shows the number of pageviews for each topic. The data in each graph has been aggregated into one week bins, with the volume on the y-axis and time on the x-axis. The show start and end dates have also been marked with vertical lines.


These two graphs reveal some interesting information:


  • The show start and end windows overlap each other, X Factor lasts for 2.5 months ( 27 August 2016 – 11 December 2016), whereas Celebrity Jungle only lasts for 1 month (13 November 2016 – 4 December 2016). X factor has more articles due to a larger time window.
  • The absolute peak for amount of pageviews per week belongs to X Factor which occurs mid-season of the show (the week of 2016-10-31). In fact, this pageview peak for X Factor is actually higher than the season finale for both shows in December.
  • Both graphs show an increase in activity when the TV show begins to air on TV, however, considering the start date for both shows X Factor (27 August 2016) and Celebrity Jungle (13 November 2016), Celebrity Jungle takes off faster than than X Factor i.e. the increase of pageviews (Figure 6) happens instantly for Celebrity Jungle whereas the first 2 weeks for X Factor the pageview count is low.
  • Writing about X Factor on average gets you a higher return on your investment than Celebrity Jungle. To understand why let’s have a look at the article publication vs amount of pageviews they get:
    • X Factor ( 27 August 2016 – 11 December 2016)
    • Number of articles published: 131 articles
    • Number of Page views: 540026
    • Pageviews vs articles published: 540026 / 131 = 4122.34
    • Celebrity Jungle (13 November 2016 – 4 December 2016)
    • Number of articles published: 54
    • Number of Page views: 175305
    • Pageviews vs articles published:  175305/54 = 3246.38
  • X Factor performs at a better return rate of 4122.34 as opposed to Celebrity Jungle’s 3246.38 across the time span of the show.
  • Publishing before the show starts doesn’t seem to give a comparable return on investment (by amount of viewed pages) compared to when the show starts. This is evident by just looking at the pageview sum before and during each show. For instance for Celebrity Jungle the page visits are low before October, then when the show starts at the beginning of November the numbers spikes up. We performed a t-test for both topics to measure the difference in page views before both shows go live and the windows when they air. Both tests show these groups are different from each other with a p-value < 0.001, indicating a high significance of the results.


Throughout this blog we have demonstrated how published content can be broken down into topics using a topic modeling approach. Applying such a technique to the data can help publishers understand how readers interact with their content. Understanding how audience interacts gives the publishers the upper hand as it allows them to optimize their time and content production, resulting in minimized resource wastage. This technique is a tool that can also lead to better audience acquisition and retention.

Want more like this?

Receive our latest insights straight to your inbox