Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78

Introduction

There is a wealth of information hidden in the contents and the markup of a web page that can be extremely useful when trying to understand what a page is all about while trawling the web. One classic example would be tags: those short phrases or keywords that bloggers and publishers use to describe what a webpage, article or blog post is about. Tags can be rendered as visual elements on the page, or hidden away using `meta` attributes.

Screen Shot 2017-02-21 at 18.34.59

example of meta attributes

It is obvious that by extracting these tags we can learn a whole lot about any blog post or article that we are analyzing. They describe a piece of content the way their author or editor would, and they may contain various pieces of information such as the high-level topical category, the entities (people, places, organizations, etc.) mentioned, or the concepts that the article is about. This makes them an excellent source of information to leverage when classifying web pages.

The problem with extracting these tags is the way webpages are structured on the web, and the way they are expressed differs greatly across web pages and sites. Different Content Management Systems used by different blogs and news websites each have their own way of presenting metadata such as tags, making it difficult to access and parse this information.

tags dark grey (3)

examples of visual tags from various blogging platforms

Today we are announcing the launch of a much-requested addition to our Article Extraction API that provides a uniform and standard interface for extracting tags from any blog post or article on the web.

Tag Extraction

We’ve supercharged our article extraction feature in the Text Analysis API to make it even easier to extract useful information from a webpage. Through our Article Extraction endpoint, users already have the ability to extract metadata such as author name, publish date, main image, article title and the main body of text from a page. But in a lot of cases, a web page will contain other useful information about that page often in the form of tags..

The Tag Extraction feature will identify and extract any relevant tags present on the page no matter the structure of the page or where the tags are present.

So how can these tags be used?

These extracted tags can be utilized in a number of ways;

To tag or classify a web page

The tags extracted can often give a very useful insight into what a page is about. These tags are often manually added by the author, an editor or the web designer meaning they can provide very accurate descriptions of what a page is about.

Take for example these tags extracted from an article on Artificial Intelligence on Wired below.

{
author :  Cade Metz,
image :  https://www.wired.com/wp-content/uploads/2017/01/GettyImages-135579222_HP-1200x630-e1485911637921.jpg,
tags : [
neural networks,
artificial-intelligence,
neural-networks,
singularity,
Singularity,
Artificial Intelligence
],
article :  For almost three weeks, Dong Kim sat at a casino in Pittsburgh and played poker against a machine. But Kim wasn’t just a...,
videos : [ ... ],
title :  Inside the Poker AI That Out-Bluffed the Best Humans,
publishDate :  2017-02-01T07:00:43+00:00,
feeds : [ ... ]
}

Classify a page according to a taxonomy

So while these extracted tags can be useful in understanding a webpage they don’t necessarily help if your aim is to classify content based on a particular taxonomy.

The tags can also be used to classify a piece of content or a page into predetermined categories according to a particular taxonomy. For example using our classification by taxonomy features, these tags provide users with the ability to categorize content efficiently.

Example:
First we extract the tags from an Irish times article on Conor McGregor

{
author :  Emmet Malone,
image :  http://www.irishtimes.com/image-creator/?id=1.2620063&origw=1440,
tags : [
UFC,
Other Sports,
Nate Diaz,
Other,
Dana White,
Sport,
Conor Mcgregor
],
article :  When this imbroglio finally blows over, we can explore what Conor McGregor has against Connecticut. For now, let’s conce...,
videos : [ ... ],
title :  Conor McGregor lays cards on table in poker game with UFC,
publishDate :  2016-04-21T22:48:00+00:00,
feeds : [ ... ]
}

You’ll see the tags present in the results above.

Next, we use our Classification by Taxonomy feature to automatically categorize the content. You’ll see from the results below that it is correctly categorized as Sports and Martial Arts.

{
text :  UFC, Other Sports, Nate Diaz, Other, Dana White, Sport, Conor Mcgregor,
taxonomy :  iab-qag,
language :  en,
categories : [
{
confident :  true,
score :  0.22010621132863611,
label :  Sports,
links : [ ... ],
id :  IAB17
},
{
confident :  true,
score :  0.11470804569304427,
label :  Martial Arts,
links : [ ... ],
id :  IAB17-20
}

Classifying text-light web pages

In most NLP driven page classification problems you rely heavily on the main body of text present on a page to give context and understanding. However, in some cases, web pages may contain little or no text which make it harder to classify or categorize. Common examples include pages that contain only a video, or a collection of photos or product style pages like the one below.

Screen Shot 2017-02-21 at 15.51.58

an example of a product page that is very light on text

As an example, let’s take the page above, it’s a product page taken from the Best Buy website. What you’ll notice is, there’s very little text on the page to use for an appropriate analysis target apart from a few headings. On top of that there are also lot’s of other elements like ads and buttons on the page which make it even harder to scrape. Now just imagine how different every product page is, it’s almost impossible to build an efficient script or bot that’s going to classify these pages successfully.

Using the Tag Extraction feature however means you can leverage other elements of the page as previously explained and shown in the results below. As you can see from the JSON results, the listed tags are very precise, and do an excellent job of describing the page in question.

{
author :  ,
image :  http://www.bestbuy.cahttps://multimedia.bbycastatic.ca/multimedia/products/100x100/104/10423/10423406.jpg,
tags : [
Smart Lights,
Switches & Plugs,
Smart Home,
Home Living,
Best Buy Canada,
Smart Lighting
],
article :  The Hue Phoenix rises to the occasion when you want to create ambience and mood lighting. Using the Philips Hue app, you...,
videos : [ ... ],
title :  Philips Hue Phoenix Table Lamp - Opal White,
publishDate :  ,
feeds : [ ... ]
}

Conclusion

Whatever your reason is for understanding web pages at scale, this new feature provides you with a fantastic opportunity to dive even deeper into the web content they analyze and analyze and classify a wider variety of web pages.

Want to try it for yourself? Click the image below to sign-up to get Free access and 1,000 calls per day with our Text Analysis API.





Text Analysis API - Sign up




0