Text Analysis, the Unsung Hero of News Aggregation
Online news aggregation services, are sites that allow you to view content from online newspapers, media outlets, blogs etc. in one place. They allow you to filter the news that you receive by category, topic, keyword, date and outlet e.g. technology, food, entertainment, in the last day, week or year and on and on. They save us a lot of time and hassle as they allow us to receive the news we want without having to hop from site to site to read updates from our favourite authors or follow topics we’re most interested in. In other words, they provide a consolidated space with the latest news and updates from many different news sources.
News Aggregators have been around for quite a while. The most progressive apps are the ones that learn and keep track of our likes and interests, in order to uncover and suggest relevant and new content which we may not have been aware of.
How does News Aggregation work?
Generally a content provider will publish their content via a feed link, which News Aggregators will subscribe to. From then on the aggregator will be informed when new content is available. These feeds from the various news sources are commonly referred to as RSS and/or Atom feeds.
For machines to analyze and attempt to understand this content they often rely on elements of Natural Language Processing and Text Analysis.
Why the need for Text Analysis?
If you consider that a news aggregator might be dealing with 50,000 plus articles per day, you can quickly see, why being able to analyze content automatically is an essential part of the process. Even if we allowed two minutes for each article to be read and classified by a human (which is ridiculously fast) it would take almost 70 days of nonstop work to get through the 50K articles. Clearly then this is a task for machines and in particular Machine Learning and Natural Language Processing in the form of Text Analysis.
What does the Text Analysis Process for News Aggregation look like?
One of the first tasks to be completed by a text analysis engine when presented with an article or URL is to strip away the clutter and extract the main text and media. This process is generally referred to as Article Extraction. The story may be processed further to summarize the article in some way, this can be useful when presenting the stories for consumption, as readers will generally spend less than 3 seconds in deciding whether or not to click the article to give it further attention.
Classification based on Named Entity Extraction
Once the text of an article has been extracted it is passed to another part of the analysis engine for classification i.e. to determine whether the story is about Technology, Arts, Entertainment, Business, Finance, etc. Classification is partly achieved by extracting the named entities such as people, places, organisations, keywords, dates, twitter handles etc. from the article. Proper categorization is critical as an article is only valuable if its audience can find it. Classification tags an article with metadata from up to 500 categories, which conform to the IPTC NewsCode taxonomy.
Further Classification and Organisation based on Concept Extraction
More sophisticated analysis engines can also extract concepts from an article. Focusing on concepts and not purely relying on keywords, when analyzing text, results in better tagging and allows aggregation services to cluster similar stories together. For example an article on the current state of the Japanese economy when passed to AYLIEN’s Concept Extraction API endpoint yielded the concepts “deflation“, “public debt“, “sales tax“, “economist“, “gross domestic product“, “abenomics“, “moody’s analytics” and “bond” among others.
A key feature of a concept extraction system is the ability to provide what is called “word sense disambiguation” i.e. the ability to realise that in a tech article that mentions Steve Jobs, the word “Apple” is more like to refer to the company than the fruit! Extracting concepts can also be coupled with Topic Modelling and Clustering, which allow the reader to follow stories as they progress through time and also allows the system to uncover and present similar stories while removing duplicate or near duplicate articles.
Going a level deeper in understanding text
Context is key to the way News Aggregators provide relevant content. We chatted with Drew Curtis, the creator of Fark.com who had this to say about the current state of News Aggregators; “They’re looking at article content only, I’m arguing that the next level, is taking content that people maybe don’t care as much about and adding context to make them care.” Drew also gave us a nice example, Net Neutrality, “it’s been around for years, but only recently did anyone figure out how to make the average person care about it.”
More sophisticated analysis engines can extract intent and high-level concepts from an article, which when combined allow you to add context, and to better understand the story, not just an article. Traditional news aggregators are only going so deep, in attempting to understand text and pushing content to readers based on topics or keywords. The next wave of news aggregators need to “understand” and distribute content, not just “tag” and distribute content.
Sentiment Analysis of content can also help add context. It can allow machines to detect the tone of a text, whether it’s positive or negative, subjective or objective. Keeping track of the types of articles (sentiment, topics, categories) that a reader consumes, shares or upvotes will allow a system to learn about a reader’s preferences and present articles that are more and more in tune with the readers tastes.
Spreading the Word
Sharing useful or interesting stories gives us some “social currency” and so we are always keen to pass on articles that we think our friends and colleagues might enjoy or find useful. A good text analysis engine will also aid in this process by, for example, providing hashtag suggestions which allow for more effective sharing of content across social media sites.
Text Analysis provides the tools that make it possible for Content Aggregation systems to make sense of the myriad news articles that are published every day and present the reader with articles that are honed to their individual tastes but only when we start focusing on machines “understanding content” before it’s recommended will news aggregators become truly powerful.