Natural Language Processing, Artificial Intelligence and Machine Learning are changing how content is discovered, analyzed and shared online. More recently, there has been a push to harness the power of Text Analytics to help understand and distribute content at scale. This is particularly evident with the popularity of recommendation engines and intelligent content analysis technologies like Outbrain and Taboola, who now have a presence on most content focused sites.
Intelligent software and technological advancements allow machines to understand content as a human would. When we read a piece of text, we make certain observations about it. We understand what it’s about, we notice mentions of companies, people, places, we understand concepts present in it, we’re able to categorize it and if needed we could easily summarize it. All because we understand it and we can process it.
Text Analysis and NLP techniques allow machines to do just that, understand and process text, but the main difference is, machines can work a lot faster than us humans.
But just how far can machines go in understanding a piece of content?
We won’t dwell too much on how the process works in this post, but instead we’ll to focus on what level of understanding a machine can extract from text by using the following news article as an example:
If you want to read more about how the process works and different approaches the Text Analysis you can download our Text Analysis 101 ebook here.
When we read a news article, we might want to know what concepts it deals with, whether it mentions people, places, dates etc. We might need to determine whether it’s positive or negative, what’s the authors intent and whether it’s written subjectively or objectively. We do this almost subconsciously when we read text. NLP and AI allow machines to somewhat mimic this process by extracting certain things from text like Keywords, Entities, Concepts and Sentiment.
Content often contains mentions of people, locations, products, organizations etc. which we collectively call Named Entities. They can also contain values such as links, telephone numbers, email addresses, currency amounts, percentages and so on. Using statistical analysis and machine learning methods, these entities can be recognized and extracted from text as shown below.
Sometimes you wish to find entities and concepts based on information that exists in an external knowledge base, such as Wikipedia. Looking beyond statistical analysis methods and using a Linked Data-aware process machines can extract concepts from text. This allows for a greater understanding of topics present in text.
Extracting concepts is a more intelligent and more accurate approach which gives a deeper understanding of text. These methods of analysis also allow machines to disambiguate terms and make decisions on how they interpret text, decisions like, is a mention of apple referring to the company or the fruit. Which is displayed as an example in the results below.
All of this information we can glean from text, showcases how machines understand it. However, it doesn’t need to stop there, with all of this information it’s possible to go a step further and start categorizing or classifying text.
Based on the meta-category of an article, we can easily understand, what a piece of content is about. As is shown in the results below from our sample article, classifying text can make it far easier to understand at a high level what a piece of content is about.
Classifying text means it is far easier to manage and sort large amounts of articles or documents without the need for human analysis which is often time consuming and inefficient.
Machines can even go as far as interpreting an author’s intent from analyzing text. Utilizing modern NLP techniques machines can determine whether a piece of text is written subjectively or objectively or whether it’s a positive, negative or neutral.
It’s also possible to process the insights to create something more, like a summarization of content for example.
Sometimes articles and documents are just too long to consume, that’s where intelligent services like the automatic summarization of text can help. After analyzing a piece of content, it’s possible for machines to extract the key sentences or points conveyed and to display them as a consumable summary, like the one below.
These are some straightforward examples of the information machines can extract from text as part of the content analysis process. Our next post, will deal with what exactly, can be done with the information gleaned from text. We’ll look at real life use cases and examples of how automated Text Analysis is shaping how we deal with content online.
The automatic classification of documents is an example of how Machine Learning (ML) and Natural Language Processing (NLP) can be leveraged to enable machines to better understand human language. By classifying text, we are aiming to assign one or more classes or categories to a document or piece of text, making it easier to manage and sort the documents. Manually categorizing and grouping text sources can be extremely laborious and time-consuming, especially for publishers, news sites, blogs or anyone who deals with a lot of content.
Broadly speaking, there are two classes of ML techniques: supervised and unsupervised. In supervised methods, a model is created based on previous observations i.e. a training set. In the case of document classification, categories are predefined and a training dataset of documents is manually tagged as part of a category. Following the creation of a training dataset, a classifier is trained on the manually tagged dataset. The idea being that, the classifier will then be able to predict any given document’s category from then on.
Unsupervised ML techniques differ because they do not require a training dataset, and in case of documents, the categories are not known in advance. Unsupervised techniques such as Clustering and Topic Modelling are used to automatically discover groups of similar documents within a collection of documents.In this blog, we are going to concentrate on supervised methods of classification.
What a Classifier does
Classifiers make ‘predictions’, that is their job. In layman terms, when a classifier is fed a new document to classify, it makes a prediction that the document belongs to a particular class or category and often returns or assigns a category “label” for the document. Depending on the classification algorithm or strategy used, the classifier might also provide a confidence measure to indicate how confident it is that the classification label is correct. To explain how a classifier works it is probably best to illustrate with a simple example.
How a Classifier works
As we mentioned classification is about prediction. Take a simple example of predicting whether or not a football game will go ahead to illustrate how this works. First we want to create a dataset. To do this in we would track the outside temperature and whether or not it rained on any given game night over the course of a year to building up a dataset of weather conditions. We could then “tag” this data set with information about whether or not the game went ahead to create a training dataset for future predictions.
In this case, we have two “features.” temperature and rain, to help us predict whether the game will be played or not played. As is illustrated in the table below. On any new match night, we could then reference our table and use it to help us predict whether or not a game would go ahead. In this simple case if the temperature is below zero and it is raining (or snowing!) then there is a good chance that the game will be cancelled.
|Temp (Degrees C)||Rain||Play?|
In the table above, each column is called a “feature”, the “Play?” column is referred to as a “class” or “label” and the rows are called “instances”. These instances can be thought of as data points, which could be represented as a vector, as shown below:
<feature1, feature2,…, featureN>
A simple Illustration of Document Classification
If we apply a similar methodology to documents we can use the words within a document as the “features” to help us predict the classification of the document. Again, using a simple example:
In this example, we have three very short documents in our training set as shown below:
|Reference Document Class 1||Reference Document Class 2||Reference Document Class 3|
|Some tigers live in the zoo||Green is a color||Go to New York city|
We would start by taking all of the words across the three documents in our training set and creating a table or vector from these words.
Then for each of the training documents, we would create a vector by assigning a 1 if the word exists in the training document and a 0 if it doesn’t, tagging the document with the appropriate class as follows.
When a new untagged document arrives for classification and it contains the words “Orange is a color” we would create a word vector for it by marking the words which exist in our classification vector.
If we then compare this vector for the document of unknown class to the vectors representing our three document classes, we would see that it most closely resembles the vector for class 2 documents.
Comparison of the unknown document class with class 1 (6 matching terms)
< 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 > class 1
< 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0> Unknown class
Comparison of the unknown document class with class 2 (14 matching terms – winner!!)
< 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0>class 2
< 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0> Unknown class
Comparison of the unknown document class with class 3 (7 matching terms)
< 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1 > class 3
< 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0> Unknown class
It is then possible to label the new document as a class 2 document with a adequate degree of confidence. This is a very simple but common example of a statistical Natural Language Processing method.
A more detailed look at real world document classification
A real world classifier has three components to it and we will look at each of these components individually to explain in a little bit more detail how a classifier works.
1. The dataset
As we demonstrated above, a statistical method of classification requires a collection of documents which have been manually tagged with their appropriate category. The quality of this dataset is by far the most important component of a statistical NLP classifier.
The dataset needs to be large enough to have an adequate number of documents in each class. For example if you wished to classify documents into 500 possible categories you may require 100 documents per category so a total of at least 50,000 documents would be required.
The dataset also needs to be of a high enough quality in terms of how distinct the documents in the different categories are from each other to allow clear delineation between the categories.
In our simple examples, we have given equal importance to each and every word when creating document vectors. We could do some preprocessing and decide to give different weighting to words based on their importance to the document in question. A common methodology used to do this is TF-IDF (term frequency – inverse document frequency). The TF-IDF weighting for a word increases with the number of times the word appears in the document but decreases based on how frequently the word appears across the entire document set. This has the effect of giving a lower overall weighting to words which occur more frequently in the document set such as “a”, “it”, etc.
3. Classification Algorithm and Strategy
In our example above, the algorithm we used to classify our document was very simple. We classified the document by comparing the number of matching terms in the document vectors to see which class it most closely resembled. In reality, we may be placing documents into more than one category type and we may also be assigning multiple labels to a document within a given category type. We may also have a hierarchical structure in our taxonomy, and therefore require a classifier that takes that into account.
For example, using IPTC International Subject News Codes to assign labels, we may give a document, two labels simultaneously such as “sports event – World Cup” and “sport – soccer”, “sports” and “sports event” being the root category and “soccer” and “World Cup” being the child categories.
There are numerous algorithms used in classification such as Support Vector Machines (SVMs), Naive Bayes, Decision Trees the details of which are beyond the scope of this blog.
We hope that you now have a better understanding of the basics of document classification and how it works. As a recap, in supervised methods, a model is created based on a training set. A classifier is then trained on this manually tagged training dataset and is expected to predict any given document’s category from then on. The biggest factor affecting the quality of these predictions is the quality of the training data set. Keep an eye on our blog for more in the “Text Analysis 101” series.
Social media and online publishing have opened up channels of mass communication to everyone. Now individuals, as well as organizations can use publishing techniques to persuade, build influence and spread ideas by sharing and distributing their content online. “Content is king” online today. However, “if you build it they will come” doesn’t always apply.
Content Discovery and Distribution
We recently spoke with Michael Schwartz, the founding partner of WebWire, who distribute business, organizational and personal news releases and press releases over the Internet. Michael told us about the four things he considers to be most important when distributing news releases.
Create well written informative content.
- Reach interested readers
- Reach appropriate influencers
- Reach appropriate members of the professional media
Michael also spoke about a common trend; “In the good old days (before the Internet), a media directory would generally guide the marketing professional to the appropriate target, and a room full of people with printed publications, scissors and tape would create effectiveness reports….those days are over.”
Traditionally the discovery, tagging and distribution of an article or piece of content would have often been carried out by a PR professional or marketer. In recent times as machines and software get smarter and the sheer volume of content out there continues to grow, parts of the process can be automated using technology. In order to match a humans work, however, machines need to be able to understand and categorize content effectively. That’s where Text Analytics and Natural Language Processing techniques come into play.
How Text Analysis enables “modern day” distribution
Creating good content that is well written, informative and engaging is central to attracting and retaining your audience’s attention. Just as “beauty is in the eye of the beholder” content needs to be well-written, informative, relevant and engaging from the viewpoint of the audience. While Text Analytics isn’t pivotal in the content creation process, elements of Text Analytics can be incorporated, into the process. For example, discovering trends, topics and identifying what is attracting engagement online can help you create relevant, informative pieces. Essentially writing good content comes down to knowledge and expertise in a certain field or area and this is particularly difficult to automate.
Once you have created an interesting, relevant and informative piece of content do you just sit back and hope it gets discovered organically?
To reach interested readers, you first need to identify who your audience is and meet them where they congregate. Today we don’t search for content we expected it to be pushed to us, through News Apps, Twitter, Facebook and so on. Analyzing content and being able to automatically extract concepts and topics from content and articles shared and distributed online allows us to identify where our target audience reside.
Text Analysis also allows us to prep our content for maximum discovery and exposure. One effective example of how this can be achieved is through the use of hashtags when sharing content on your social media sites. Being able to understand text and extract topics and concepts automatically means we can also ensure they are distributed appropriately for maximum exposure on social channels.
While you may distribute your content in the right place, to get it to stand out in an extremely crowded space online can be quite difficult. Traditionally relationships with the right journalists or individuals helped in this regard. Today, we have new targets to help with amplification in the form of influencers.
Influencers online have generally built up a reputation as trusted and knowledgeable sources of information in a particular subject area. People see them as thought leaders and rely on them as a source of content or informal advice in some ways. An influencer picking up on and sharing your content doesn’t always happen organically. Identifying appropriate influencers to target with you content was traditionally about relationships. Today technology has made the identification of and access to these influencers somewhat easier. By effectively analyzing content, we can match content to appropriate influencers, based on interests, keywords, entities, topics and concepts they write about.
That isn’t to say there isn’t a place for utilizing journalists to increase exposure.
Similarly, reaching appropriate members of the professional media can be achieved by matching articles to individual journalist’s areas of interest and writing style. It is vitally important that the matching process is accurate, as sending someone content they have no interest in is technically spam. Being able to analyze opinions can also help be a lot more targeted. Through Sentiment Analysis of content, we can understand a writer’s opinion and target them with appropriate releases. For example, a journalist who writes about technology but is of the opinion that Android trumps iOS isn’t going to want to publish a piece on how great the new iPhone 6 is.
The internet has fundamentally changed how we communicate with each other. Social Media sites, blogs, forums and mainstream media sites proliferate and rise and fall in popularity. Keeping track of the most appropriate outlets for any given piece of content is an increasingly important, difficult and time-consuming task. The ability to automate many parts of the process allows for large volumes of content to be matched accurately with the most appropriate audiences.Text Analysis can be used to reliably aid traditional distribution tactics and processes but while it may aid the process it is yet to be seen whether technology will trump human relationships with the right people built over time.