Text and Image Analysis: From pixels to characters and back
Image taken from Linchi Kwok’s blog
A picture is worth how many words?
Without a doubt, one of the key things that separate us humans from the rest of the animals, is the way we communicate: the volume, level of complexity and comprehensiveness of communication among humans is far greater than that of any other animal’s.
Over time we have developed and refined our communication methods by creating new conventions (symbols, languages) as well as new channels (telegraph, telephone, newspapers) for communicating with one another, and with the rapid development of technology in the modern age, this trend has accelerated.
Today it’s easier than ever to communicate through online platforms, news sites, social channels and communities, and we are communicating with each other more than ever before by publishing, sharing and consuming information in many forms – most notably text and images.
The volume, of both text and images being published and consumed online, is growing exponentially:
Number of new photos added to Facebook every month:
- 2009: 2.5bn
- 2010: 3+bn
- 2012: 9bnSource: http://royal.pingdom.com/2013/01/16/internet-2012-in-numbers/
Number of Tweets posted per day:
- 2009: 17m
- 2010: 68m
- 2011: 300m
- 2012: 450m
So in order to make sense of this vast amount of content, and to get a more holistic view of the world, we need to develop scalable and adaptable technologies that are capable of analyzing both text and images.
Similarities and Differences Between Text and Images
Both are major communication mediums that were once very closely related. In fact, early written languages could be seen as a form of imagery.
Also in modern day, both are found in abundance, nowhere more so than on the internet. Separately on the likes of Flickr, blogs or forums, but more often together, on platforms like Instagram, Twitter, Facebook and news sites, each with varying degrees of text to image ratio.
There is no doubt that text is the most efficient communication medium utilized today. It is better for expressing thoughts, ideas, opinions and abstract concepts. Text gives you a lot more control over the point you want to get across – level of ambiguity, tone, context and so on.
Images on the other hand have a varying degree of expressiveness. An image is indeed worth a thousand words, but only when there’s an image to describe your message. How can one describe an abstract concept like Entropy, Human Rights or Wabi-sabi with an image? Let’s look at a brief description of these concepts from Wikipedia:
“Entropy is a thermodynamic quantity representing the unavailability of a system’s thermal energy for conversion into mechanical work, often interpreted as the degree of disorder or randomness in the system.”
“Human rights are moral principles or norms that describe certain standards of human behaviour, and are regularly protected as legal rights in national and international law.”
“Wabi-sabi (侘寂) represents a comprehensive Japanese world view or aesthetic centered on the acceptance of transience and imperfection.”
It almost hurts to try and visualize any of these concepts with an image, but the textual description does a pretty good job at giving you a general idea about them, to say the least.
While text does a much better job of describing concepts and expressing thoughts or feelings, images are sometimes easier and more efficient to produce and in some cases, they are a more appropriate and effective way to express yourself.
Question: How many of your friends have written an essay about their newborn’s first smile? How many have posted a picture or video to Facebook?
The truth is, you can capture a lot of the simple facts and events in your surroundings with a single image, and with minimal effort compared to text. Simply, point your phone’s camera at something, and with a single tap you’ve captured and shared the moment with the entire world.
Text and Image Analysis
While text and images differ in many ways and can exist independently, they are in fact complementary and non-competing communication mediums, and to get a holistic view of the world, we would need to analyze both. Understanding images is as important as understanding text, as together they provide a more accurate picture of reality.
From an AI perspective, this means we need hybrid systems that are not only capable of understanding both mediums, but are also able to discover links between the two and leverage those links to enhance the overall performance and accuracy of such analysis systems.
Those of you who read our blog regularly, know that we are a Text Analysis company, however at a higher level, we are an AI company and therefore have a strong interest in how complementary AI solutions such as Image Analysis can be used to give us a better understanding of the real world. In particular, we have been interested in how Text Analysis and Image Analysis could be married to improve the insight gathered from content that’s produced on the Internet.
To put some of our ideas into practice, we started by collecting over 150,000 news articles from about 50 major news outlets. We wanted to see if there’s a strong link or correlation between the text of an article and the images used in it. For each of these articles, we extracted the article’s text as well as its main images. Next, we analyzed the text of each article using our Text Analysis API to find the high-level category of the article (e.g. Technology, Sports, Food, etc) as well as specific concepts and topics mentioned in the articles (e.g. People, Places, Organizations, etc).
The images accompanying the text were then analyzed using Imagga’s Tagging API, which for any given image, provides a set of tags describing the objects seen in the image. The analysis was performed independently, so when we’re analyzing the text we had no information about the images and vice versa.
What we discovered wasn’t exactly ground-breaking but it did prove a few theories we had and affirmed a connection between images and text and how both can be used to improve insights gathered from an analysis point of view.
Categorization of Articles
As mentioned above, we tagged each image as part of a particular category and cross-referenced it with the categories that were identified in the text to try and uncover similarities and links between the two.
For instance, for the ABC News article titled “Kate Hudson Shows Off Her Amazing Abs” we got “people – celebrity” as the main category of the text, and the following image as the main image of the article:
attractive (28%), model (25%), portrait (24%), pretty (23%), hair, sexy, person, adult, face, blond, caucasian, body, people, fashion, lady, cute, women, smile, glamour, happy, sensual, studio, smiling, human, blonde, lingerie, clothing, erotic, expression, lifestyle, looking, slim, style, fun, gorgeous, healthy, light brown, orange, skin, grey, black, red, pink.
What we’re doing below is finding the most confidently tagged images for each major category, and creating a mosaic out of those images:
What we see here is a strong link between the high-level category of the text of an article, and the main image used in it.
We also went a step further and extracted concepts and entities mentioned in these articles, such as notable people, places, organizations, general concepts and so on and we looked for a link between these and the tags assigned to the main image of the article.
Here we observed that for the most part, there’s a strong association between people, organizations and brands mentioned in an article and the images that accompany them.
However, this is not the case for some types of entities such as places or more abstract concepts such as Human Rights. Meaning that when you’re talking about a person, you’re more likely to use an image of a person but when you’re talking about a city, you might use any kind of imagery – as shown in the contrast of the mosaics for Apple Inc. and Obama vs New York City and Human Rights:
So what is this all about and what are some of the ways we can use text and image analysis together?
When classifying a document such as an article, we can improve the classification accuracy by analyzing text and images simultaneously. Moreover, images are in some cases more universal, compared to text that can be written in various languages, therefore in cases where we can’t analyze the text properly, hybrid analysis could allow us to rely on images for categorization.
Named Entity Disambiguation
As mentioned above, text can be ambiguous: when we say “apple” are we referring to the company or the fruit? That’s the problem Named Entity Disambiguation tries to solve. But it relies on textual clues for doing so, for instance if there’s a mention of Steve Jobs in the same article, we might be referring to the company.
But what if there’s not enough textual context (let’s say, in a tweet) to provide those textual clues? Well, we need to look elsewhere and thankfully a lot of digital content such as articles, tweets and comments are often accompanied by an image, which if analyzed can also provide contextual clues. As an example, compare the set of images we had for Apple Inc. with what we have below, of images tagged as containing a fruit according to Imagga:
So in the Apple Inc. or apple scenario, if our tweet contains an image that is more similar to the images above than the ones about Apple Inc., we can confidently mark the mention of “apple” as a fruit, and not a company.
We’re looking forward to seeing what more we can do by combining text and image analysis and how a hybrid approach, can uncover greater insight from content online.