Product

Categorizing Non-Article Webpages

As you know, from time to time at AYLIEN we like to share useful code snippets, apps and use cases that either we’ve put together or our users have built. For this blog we wanted to showcase a neat little script one of our engineers hacked together that can be used when the target webpage your analyzing is lite on content.

Our standard Article Extraction feature extracts the main body of text, the primary image, the headline, author etc. It’s primarily designed for article type, or blog type pages where there is a significant chunk of text on the page.

But what happens when you want to Categorize a home page or even a domain, or what if there just isn’t enough text on a page to analyze? 

You have to look elsewhere.

While these type of pages (product pages, home pages, feature pages) may not have 1 large article style piece of text they often have other strong indicators and text hidden in other areas, such as:

  • Headers
  • Meta descriptions
  • Meta tags for images
  • Keyword Tags

 

These other text sources on webpages can be effectively leveraged when trying to classify “text light” webpages.

Here’s an example demonstrating and explaining how it’s done using our Node.js SDK:

var _ = require('underscore'),
    cheerio = require('cheerio'),
    request = require('request'),
    AYLIENTextAPI = require("aylien_textapi");

var textapi = new AYLIENTextAPI({
    application_id: "YOUR_APP_ID",
    application_key: "YOUR_APP_KEY"
});

var url = 'http://www.bbc.com/';
request(url, function(err, resp, body) {
  if (!err) {
    var text = extract(body)
    textapi.classifyByTaxonomy({'text': text, 'taxonomy': 'iab-qag'}, function(err, result) {
      console.log(result.categories);
    });
  }
});

function getText(tagName, $) {
  var texts = _.chain($(tagName)).map(function(e) {
    return $(e).text().trim();
  }).filter(function(t) {
    return t.length > 0;
  }).value();

  return texts.join(' ');
}

function extract(body) {
  var $ = cheerio.load(body);
  var keywords = $('meta[name="keywords"]').attr('content');
  var description = $('meta[name="description"]').attr('content');
  var imgAlts = _($('img[alt]')).map(function(e) {
    return $(e).attr('alt').trim();
  }).join(' ');

  var h1 = getText('h1', $);
  var h2 = getText('h2', $);
  var links = getText('a', $);
  var text = [h1, h2, links, imgAlts].join(' ');

  return text;
}

In a nutshell, what we’re doing here is extracting the text present in the meta tags, H1 and H2 tags, linked text and image alt attributes within the pages HTML instead of relying on the main body of text.

Our ad tech focused customers currently use this or similar approaches when classifying text light pages and domains in order to be able to classify a page or domain against the IAB QAG taxonomy.

You can copy and paste the snippet for further use or test it out in our Sandbox Environment.





Text Analysis API - Sign up




Author


Avatar

Mike Waldron

Head of Marketing & Sales @ AYLIEN A legal convert with a masters degree from Smurfit Business School, Mike runs our Sales and Marketing at AYLIEN. Mike gathered his Sales and Marketing experience with technology companies in Sydney and Dublin before getting the startup itch and joining the team at AYLIEN. Twitter: @MikeWallly