As part of our blog series, ‘Text Analysis 101: a basic understanding for business users’, we will aim to explain how Text Analysis and Natural Language Processing works from a non-technical point of view.
For the first installment, we are going to discover how text is understood by machines, what methods are used in text analysis and why Entity and Concept extraction techniques are so important in the process.
Text Analysis refers to the process of retrieving high-quality information from text. It involves Information Retrieval (IR) processes and lexical analysis techniques to study word frequency, distributions, patterns and utilizes information extraction and association analysis to attempt to understand text. The main goal of Text Analysis as a practice is to turn text into data for further analysis, whether that is from a business intelligence, research, data analytics or investigative perspective. There are certain aspects of text, that can be identified with modern techniques, that allow machines to understand a document, article or piece of text.
Technological advancements, greater computing power and investment in research has meant Natural Language Processing techniques have evolved, performance has improved and adoption across the business world has grown dramatically with the Text Analytics market now, according to Alta Plana’s latest report “Text Analytics 2014: User Perspectives on Solutions and Providers.” having an estimated market value exceeding $2bn.
Traditionally NLP techniques focused on words. These techniques relied on statistical algorithms to analyze and attempt to understand text. However, there has been a push in recent times to equip machines with the capabilities to not just analyze, but to “understand” text. There are numerous approaches to the problem some being more popular and more accurate than others.
Document Representation Models – Bag of Words and Bag of Concepts
Traditionally, analysis systems were focused on words and they failed to identify concepts when attempting to understand text. The diagram below outlines how, as we move up the pyramid and consider concepts in our analysis, we move closer to machines extracting meaning from text.
Bag of Words
The bag-of-words model is a representation that has been traditionally used in NLP and IR. In this model, all grammar, sentence structure and word order can be disregarded and a piece of text, a document or a sentence can be seen or represented as a “bag of words”. The collection of words can then be analyzed using the Document-term Matrix for occurrences of certain words, in order to better understand the document, based on its most representative terms. While analyzing words is somewhat successful, a greater focus on concepts within text has proven to increase a machine’s overall understanding of text.
Bag of Concepts
Looking beyond just the words on the surface of a document can provide context to improve a computers’ understanding of text. As demonstrated in the pyramid above, analyzing the words alone can be seen as a base level analysis while considering concepts as part of the analysis goes a step further to improve overall understanding.
While a concept based approach may provide greater insight, by not relying on the words alone and considering concepts as part of the analysis process. Combining both the BoW and BoC approaches to understanding text, performance and accuracy can be greatly improved. This is especially true when we are dealing with a somewhat lesser known sample of text.
You can read more about the Bag of Concepts approach here:
To move towards more of a concept-based model of Text Analysis we need to be able to identify entities and concepts within a text. In order to understand how this is done it’s important to discuss, what entities and concepts are and how we identify and utilize them from an analysis point of view.
An entity is something that exists in itself, a thing with distinct or independent existence.
A concept can be defined as an abstract or generic idea generalized from particular instances.
But how can machines recognize entities and concepts in text?
Named Entity Recognition (NER)
Also known as Entity Extraction, NER, aims to automatically locate and classify elements of text into predefined categories such as the names of persons, organizations, locations, expression of times, quantities, monetary values, percentages, etc. The NER approach uses either linguistic grammar-based techniques or statistical modelling techniques or both to identify and extract entities from text.
Consider the following piece of text as an example:
“Michael loved the Apple iPhone. He always admired Steve Jobs, but he couldn’t justify spending over $500 on a new phone.”
Using NER certain mentions of Entities can be identified in a sentence or entire piece of text, as is highlighted below:
“Michael [Person] loved the Apple [Organization] iPhone . He always admired Steve Jobs [Person] but he couldn’t justify spending over $500 [Money] on a new phone.”
It isn’t always possible, however, to identify entities in a piece of text using NER exclusively. Written language isn’t always exact and trying to understand a piece of text without considering the context can lead to inaccuracies. Is a mention of Apple referring to the company, the fruit or even the artist Billy Apple? That is where disambiguation and concepts can add more clarity and accuracy to the analysis process.
Named Entity Disambiguation (NED)
Named entity disambiguation can be used to identify and extract concepts from text. Its approach to the problem differs to NER in that it doesn’t rely on grammar or statistics. Also known as entity linking, NED utilizes a knowledge base to use as a reference to identify entities. This could be a public knowledge base like Wikipedia or a training text which is often domain specific.
The process is outlined simply below:
Step 1. Spotting: looking for surface forms like “apple” (the sequence of the letters a-p-p-l-e)
Step 2. Candidate generation: identifying potential candidates, Apple inc, Apple (the fruit), Billy Apple etc.
Step 3. Disambiguation: referencing a knowledge base and considering the context to identify a concept.
Entities vs Concepts
For the most part it is often best to identify and extract both named entities and concepts in order to fully understand a piece of text. Entities may be common and well known and easy to identify, but there may also be concepts within your text that would be overlooked without the disambiguation process.
Identifying concepts does have some advantages over only considering entities as part of the entire analysis process. By referring to a knowledge base, like Wikipedia, further information about a concept can be identified and utilised. For example, in an article that mentions Steve Jobs, iPhone, Mac and Palo Alto but not “apple”, based on the information sourced in your knowledge base, you could still identify “apple” as a concept.
Concepts can also be used to pull additional information and insights from a knowledge base, providing an automated and straightforward way to enhance and augment any document. For instance, for every concept of type “place”, a map of that place could be added to the document, knowing the place’s exact latitude and longitude.
Being able to identify Entities and Concepts means key aspects can be identified and extracted from documents, articles, emails etc. which allows machines to provide greater analysis and enhancement capabilities and a deeper understanding of text.
Our next blog in the series will focus on how text is classified and summarized automatically.