The Power of Named Entities
While information extraction techniques such as Topic Modeling help in discovering what lies hidden within your content it offers little control over what we may want to discover.
For example, in our previous article about topic modelling we saw how topics such as diet and healthcare could be found in our data, together with topics on specific celebrities such as Cheryl Cole. However, we had little control over what topics the topic modelling algorithm would find in the data. If we know that we are looking for a particular celebrity, such as Cheryl Cole, then we can just find all the articles that contain the exact string “Cheryl Cole”.
Instead, imagine that the requirement is to find the most mentioned people in a set of articles without knowing who they are beforehand. A more appropriate tool for this type of task is Named Entity Recognition (NER).
What are named entities?
Simply put Named Entities (NE) are real-world objects that are given a name. NER is a form of information extraction performed by identifying these Named entities in text.
The main named entity categories are:
Figure 1 shows a toy example of the type of output a named entity recogniser might produce. The output includes names of People such as Tessa Virtue, an Olympic athlete, and Theresa May, a British politician. Locations can also be seen such as Pyeongchang and the United Kingdom. Named entities are often a good approximation of the whole document. In the above example, we can identify the first two documents as being about Sports and the other one about Politics by just looking at the named entities.
How Does NER work?
Think of NER as a classifier, where each word in the sentence is given a Part-of-Speech Tag (POS) as its label, most models are pre-labeled and trained on widely available corpora such as the Brown corpus.
Consider the sentence:
“In 2009, Silicon Valley titans Apple and Google shared two prominent board members, including Google’s then-CEO Eric Schmidt.”
In Figure 2 we can see the above sentence is assigned to the corresponding POS tags. On top of this NER performs classification of text, where Nouns are set into a corresponding type (which isn’t necessarily limited to Person, Organisation and Location).
The resulting entities from the above sentence are:
In 2009, Silicon Valley [Location] titans Apple [Organization] and Google [Organization], shared two prominent board members, including Google’s [Organization] then-CEO Eric Schmidt [Person].
A thing to note is that NER does not support nested parsing of Named entities, for instance, “Bank of Ireland” would be captured as an Organisation Entity rather than Ireland the location.
What can NER be used for?
There are many ways in which business use NER. Search engines such as Google or Bing use named entities to find the most relevant content for their users. Back in 2010 internal studies by Microsoft found that 20-30% of queries submitted to search engines consisted of only a named entity, whereas 71% of queries contained at least one named entity. Which shows that named entities have been a core component of information representation.
Named entities can be used on unstructured text submitted to categorize different content components from a collection of documents. For example, a Job recruiter can use named entities to build a structured profile about any organization or employee from an uploaded document.
Companies can also analyse customer support queries submitted via social media using NER as customers queries might contain key information such as locations, product names and organisation names. NER can then be used as a to categorizing customer support tickets and route them to the most appropriate person.
A final example is a publishers who wants to categorize documents based on a particular named entity, such as a celebrity or a location. The named entities can also be used to cross reference articles which have shared named entities. This relationship can also be used to improve the recommendation of similar articles.
What can you do with NER?
For the purpose of a demonstration and to get a glimpse of NER in action we have processed over 500k documents from RTÉ, the national public service broadcaster of the Republic of Ireland. The articles in questions are from the time period August 2017 – March 2018. The documents were processed using the popular open source natural language processing toolkit MITIE. The analysis showed that the documents were composed of about 24k unique Person named entities (e.g. Donald Trump, Steven Spielberg and Conor McGregor). We decided to focus on the sports section to find the top sports people over that time period. The sports documents were further segmented based on their category such as Soccer, Golf, and Horse Racing to give us an insight into what named entities stand out from each category. This reduced the number of named entities from about 24k to about 10k between 19 sports categories.
Top Sports Person Named Entities in News from Sept. 2017 to March 2018
The interactive chart in Figure 3 shows the person named entities found in the sports documents, grouped by sport. The table shows the top 10 named entities based on document occurrence arranged in descending order. A named entity needs to appear a minimum of two times in a document to be included in a document count. The motivation for this filter is to exclude one sentence throw-away mentions of named entities where they are not the focus of the story. The “Reads” Column in Figure 3 is the number of reads the sportsperson got over the time period analyzed. The Pie Chart represents the proportion of the Total, which is the sum of “Reads” column.
By further inspecting the chart we can find interesting, if unsurprising information such as the soccer section being dominated by premier league managers, but the top named entity in the soccer section is the manager of the Republic of Ireland national team, Martin O’Neill. The dataset is highly connected to Ireland and is read by a predominantly Irish audience so the bias towards Irish athletes is therefore expected. Another example of this bias is that Ireland did not perform well (compared to other countries) in the 2018 winter Olympics, however, the named entities which appear in most documents about Olympics which also get a majority of the reads are about Irish skiers.
There is not always a linear relationship between the number of documents a named entity appears and the number of reads it will get as a result. For instance, the golf icon Tiger Woods appears in 46 articles with 111k reads, whereas, Paul Dunne an Irish golfer, on the other hand, gets more reads, 131k while only appearing in half as many articles.
Finally, it is also important to note that not all articles in which a sports person is mentioned contribute equally to that sports persons total in Figure 3. This is illustrated in Figure 4 which examines the composition of the total reads Irish boxer Katie Taylor received. It clearly shows that there is a small number of articles that receive a large number of reads such as article “Katie Taylor is the champion of the world”, which has received over 100k visits, followed by a large number of articles that have received a proportionally smaller amount of reads. This is not obvious by looking at the overall figure.
Number of reads per Katie Taylor article
Analyzing what type of entities emerge popular among readers give the content provider a powerful insight into user behaviour if used correctly can be transformed into an effective response tool. For some experimentation using a Named Entity Recognition systems with your own choice of the text see here.