Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78

I presented some preliminary work on using Generative Adversarial Networks to learn distributed representations of documents at the recent NIPS workshop on Adversarial Training. In this post I provide a brief overview of the paper and walk through some of the code.

Learning document representations

Representation learning has been a hot topic in recent years, in part driven by the desire to apply the impressive results of Deep Learning models on supervised tasks to the areas of unsupervised learning and transfer learning. There are a large variety of approaches to representation learning in general, but the basic idea is to learn some set of features from data, and then using these features for some other task where you may only have a small number of labelled examples (such as classification). The features are typically learned by trying to predict various properties of the underlying data distribution, or by using the data to solve a separate (possibly unrelated) task for which we do have a large number of labelled examples.

This ability to do this is desirable for several reasons. In many domains there may be an abundance of unlabelled data available to us, while supervised data is difficult and/or costly to acquire. Some people also feel that we will never be able to build more generally intelligent machines using purely supervised learning (a viewpoint that is illustrated by the now infamous LeCun cake slide).

Word (and character) embeddings have become a standard component of Deep Learning models for natural language processing, but there is less consensus around how to learn representations of sentences or entire documents. One of the most established techniques for learning unsupervised document representations from the literature is Latent Dirichlet Allocation (LDA). Later neural approaches to modeling documents have been shown to outperform LDA when evaluated on small news corpus (discussed below). The first of these is the Replicated Softmax, which is based on the Restricted Boltzmann Machine, and then later this was also surpassed by a neural autoregressive model called DocNADE.

In addition to autoregressive models like the NADE family, there are two other popular approaches to building generative models at the moment – Variational Autoencoders and Generative Adversarial Networks (GANs). This work is an early exploration to see if GANs can be used to learn document representations in an unsupervised setting.

Modeling Documents with Generative Adversarial Networks

In the original GAN setup, a generator network learns to map samples from a (typically low-dimensional) noise distribution into the data space, and a second network called the discriminator learns to distinguish between real data samples and fake generated samples. The generator is trained to fool the discriminator, with the intended goal being a state where the generator has learned to create samples that are representative of the underlying data distribution, and the discriminator is unsure whether it is looking at real or fake samples.

There are a couple of questions to address if we want to use this sort of technique to model documents:

  • At Aylien we are primarily interested in using the learned representations for new tasks, rather than doing some sort of text generation. Therefore we need some way to map from a document to a latent space. One shortcoming with this GAN approach is that there is no explicit way to do this – you cannot go from the data space back into the low-dimensional latent space. So what is our representation?
  • As this requires that both steps are end-to-end differentiable, how do we represent collections of discrete symbols?

To answer the first question, some extensions to standard GAN model train an additional neural network to perform this mapping (like this, this and this).

Another more simple idea is to just use some internal part of the discriminator as the representation (as is done in the DCGAN paper). We experimented with both approaches, but so far have gotten better results with the latter. Specifically, we use a variation on the Energy-based GAN model, where our discriminator is a denoising autoencoder, and use the autoencoder bottleneck as our representation (see the paper for more details).

As for representing discrete symbols, we take the most overly-simplified approach that we can – assume that a document is just a binary vector of bag-of-words (ie, a vector in which there is a 1 if a given word in a fixed vocabulary is present in a document, and a 0 otherwise). Although this is actually still a discrete vector, we can now just treat it as if all elements are continuous in the range [0, 1] and backpropagate through the full network.

The full model looks like this:

adm-large

Here z is a noise vector, which passes through a generator network G and produces a vector that is the size of the vocabulary. We then pass either this generated vector or a sampled bag-of-words vector from the data (x) to our denoising autoencoder discriminator D. The vector is then corrupted with masking noise C, mapped into a lower-dimensional space by an encoder, mapped back to the data space by a decoder and then finally the loss is taken as the mean-squared error between the input to D and the reconstruction. We can also extract the encoded representation (h) for any input document.

An overview of the TensorFlow code

The full source for this model can be found at https://github.com/AYLIEN/adversarial-document-model, here will just highlight some of the more important parts.

In TensorFlow, the generator is written as:


def generator(z, size, output_size):
    h0 = tf.nn.relu(slim.batch_norm(linear(z, size, 'h0')))
    h1 = tf.nn.relu(slim.batch_norm(linear(h0, size, 'h1')))
    return tf.nn.sigmoid(linear(h1, output_size, 'h2'))

This function takes parameters containing the noise vector z, the size of the generator’s hidden dimension and the size of the final output dimension. It then passes noise vector through two fully-connected RELU layers (with batch norm), before passing the output through a final sigmoid layer.

The discriminator is similarly straight-forward:


def discriminator(x, mask, size):
    noisy_input = x * mask
    h0 = leaky_relu(linear(noisy_input, size, 'h0'))
    h1 = linear(h0, x.get_shape()[1], 'h1')
    diff = x - h1
    return tf.reduce_mean(tf.reduce_sum(diff * diff, 1)), h0

It takes a vector x, a noise vector mask and the size of the autoencoder bottleneck. The noise is applied to the input vector, before it is passed through a single leaky RELU layer and then mapped linearly back to the input space. It returns both the reconstruction loss and the bottleneck tensor.

The full model is:


with tf.variable_scope('generator'):
    self.generator = generator(z, params.g_dim, params.vocab_size)

with tf.variable_scope('discriminator'):
    self.d_loss, self.rep = discriminator(x, mask, params.z_dim)

with tf.variable_scope('discriminator', reuse=True):
    self.g_loss, _ = discriminator(self.generator, mask, params.z_dim)

margin = params.vocab_size // 20
self.d_loss += tf.maximum(0.0, margin - self.g_loss)

vars = tf.trainable_variables()
self.d_params = [v for v in vars if v.name.startswith('discriminator')]
self.g_params = [v for v in vars if v.name.startswith('generator')]

step = tf.Variable(0, trainable=False)

self.d_opt = tf.train.AdamOptimizer(
    learning_rate=params.learning_rate,
    beta1=0.5
)

self.g_opt = tf.train.AdamOptimizer(
    learning_rate=params.learning_rate,
    beta1=0.5
)

We first create the generator, then two copies of the discriminator network (one taking real samples as input, and one taking generated samples). We then complete the discriminator loss by adding the cost from the generated samples (with an energy margin), and create separate Adam optimisers for the discriminator and generator networks. The magic Adam beta1 value of 0.5 comes from the DCGAN paper, and similarly seems to stabilize training in our model.

This model can then be trained as follows:


def update(model, x, opt, loss, params, session):
    z = np.random.normal(0, 1, (params.batch_size, params.z_dim))
    mask = np.ones((params.batch_size, params.vocab_size)) * np.random.choice(
        2,
        params.vocab_size,
        p=[params.noise, 1.0 - params.noise]
    )
    loss, _ = session.run([loss, opt], feed_dict={
        model.x: x,
        model.z: z,
        model.mask: mask
    })
    return loss
 
# … TF training/session boilerplate …
 
for step in range(params.num_steps + 1):
    _, x = next(training_data)
 
    # update discriminator
    d_losses.append(update(
        model,
        x,
        model.d_opt,
        model.d_loss,
        params,
        session
    ))
 
    # update generator
    g_losses.append(update(
        model,
        x,
        model.g_opt,
        model.g_loss,
        params,
        session
    ))

Here we get the next batch of training data, then update the discriminator and generator separately. At each update, we generate a new noise vector to pass to the generator, and a new noise mask for the denoising autoencoder (the same noise mask is used for each input in the batch).

Experiments

To compare with previous published work in this area (LDA, Replicated Softmax, DocNADE) we ran some experiments with this adversarial model on the 20 Newsgroups dataset. It must be stressed that this is a relatively toy dataset by current standards, consisting of a collection of around 19000 postings to 20 different newsgroups.

One open question with generative models (and GANs in particular) is what metric do you actually use to evaluate how well they are doing? If the model yields a proper joint probability over the input, a popular choice is to evaluate the likelihood of a held-out test set. Unfortunately this is not an option for GAN models.

Instead, as we are only really interested in the usefulness of the learned representation, we also follow previous work and compare how likely similar documents are to have representations that are close together in vector space. Specifically, we create vectors for every document in the dataset. We then use the held-out test set vectors as “queries”, and for each query we find the closest N documents in the training set (by cosine similarity). We then measure what percentage of these retrieved training documents have the same newsgroup label as the query document. We then plot a curve of the retrieval performance for different values of N. The results are shown below.

20ng_results

Precision-recall curves for the document retrieval task on the 20 Newsgroups dataset. ADM is the adversarial document model, ADM (AE) is the adversarial document model with a standard Autoencoder as the discriminator (and so it similar to the Energy-Based GAN), and DAE is a Denoising Autoencoder.

Here we can see a few notable points:

  • The model does learn useful representations, but is still not reaching the performance of DocNADE on this task. At lower recall values though it is better than the LDA results on the same task (not shown above, see the Replicated Softmax paper).
  • By using a denoising autoencoder as the discriminator, we get a bit of a boost versus just using a standard autoencoder.
  • We get quite a large improvement over just training a denoising autoencoder with similar parameters on this dataset.

We also looked at whether the model produced results that were easy to interpret. We note that the autoencoder bottleneck has weights connecting it to every word in the vocabulary, so we looked to see if specific hidden units were strongly connected to groups of words that could be interpreted as newsgroup topics. Interestingly we find some evidence of this as shown in the table below, where we present the words most strongly associated with three of these hidden units. They do generally fit into understandable topic categories, with a few of exceptions. However, we note that these are cherry-picked examples, and that overall the weights for a specific hidden unit do not tend to strongly associate with single topics.

Computing Sports Religion
windows hockey christians
pc season windows
modem players atheists
scsi baseball waco
quadra rangers batf
floppy braves christ
xlib leafs heart
vga sale arguments
xterm handgun bike
shipping bike rangers

 

We can also see reasonable clustering of many of the topics in a TSNE plot of the test-set vectors (1 colour per topic), although some are clearly still being confused:

20ng_tsne

Conclusion

We showed some interesting first steps in using GANs to model documents, admittedly perhaps asking more questions than we answered. In the time since the completion of this work, there have been numerous proposals to improve GAN training (such as this, this and this), so it would be interesting to see if any of the recent advances help with this task. And of course, we still need to see if this approach can be scaled up to larger datasets and vocabularies. The full source code is now available on Github, we look forward to seeing what people do with it.





Text Analysis API - Sign up




0

Every day, over 100,000 flights carry passengers to and from destinations all around the world, and it’s safe to say air travel brings out a fairly mixed bag of emotions in people. Through social media, customers now have a platform to say exactly what’s on their mind while they are traveling, creating a real-time stream of customer opinion on social networks.

If you follow this blog you’ll know that we regularly use Natural Language Processing to get insights into topical subjects ranging from the US Presidential Election to the Super Bowl ad battle. In this post, we thought it would be interesting to collect and analyze Tweets about airlines to see how passengers use Twitter as a platform to voice their opinion. We wanted to compare how often some of the better known airlines are mentioned by travelers on Twitter, what the general sentiment of those mentions were, and and how people’s sentiment varied when they were talking about different aspects of air travel.

Collecting Tweets

We chose five airlines, gathered 25,000 of the most recent Tweets mentioning them (from Friday, June 9). We chose the most recent Tweets in order to get a snapshot of what people were talking about in Tweets at any given time.

Airlines

The airlines we chose were:

  1. American Airlines – the largest American airline
  2. Lufthansa – the largest European airline
  3. Ryanair – a low-fares giant that is always courting publicity
  4. United Airlines – an American giant that is always (inadvertently) courting publicity
  5. Aer Lingus – naturally (we’re Irish).

Analysis

We’ll cover the following analyses:

  • Volume of tweets and mentions
  • Document-Level Sentiment Analysis
  • Aspect-based Sentiment Analysis

Tools used

Sentiment Analysis

Sentiment analysis, also known as opinion mining, allows us to use computers to analyze the sentiment of a piece of text. Essentially analyzing the sentiment of text allows us to get an idea of whether a piece of text is positive, negative or neutral.

For example, below is a chart showing the sentiment of Tweets we gathered that mentioned our target airlines.

This chart shows us a very high-level summary of people’s opinions towards each airline. You can see that the sentiment is generally more negative than positive, particularly in the case of the two US-based carriers, United and American. We can also see that negative Tweets account for a larger share of Ryanair’s Tweets than any other airline. While this gives us a good understanding of the public’s opinion about these certain airlines at the time we collected the tweets, it actually doesn’t tell us much about what exactly people were speaking positively or negatively about.

Aspect-based Sentiment Analysis digs in deeper

So sentiment analysis can tell us what the sentiment of a piece of text is. But text produced by people usually talks about more than one thing and often has more than one sentiment. For example, someone might write that they didn’t like how a car looked but did like how quiet it was, and a document-level sentiment analysis model would just look at the entire document and add up whether the overall sentiment was mostly positive or negative.

This is where Aspect-based Sentiment Analysis comes in, as it goes one step further and analyzes the sentiment attached to each subject mentioned in a piece of text. This is especially valuable since it allows you to extract richer insights about text that might be a bit complicated.

Here’s an example of our Aspect-based Sentiment Analysis demo analyzing the following piece of text: “This car’s engine is as quiet as hell. But the seats are so uncomfortable!”

absa screenshot 1

It’s clear that Aspect-based Sentiment Analysis can provide more granular insight into the polarity of a piece of text but another problem you’ll come across is context. Words mean different things in different contexts – for instance quietness in a car is a good thing, but in a restaurant it usually isn’t – and computers need help understanding that. With this in mind we’ve tailored our Aspect-based Sentiment Analysis feature to recognize aspects in four industries: restaurants, cars, hotels, and airlines.

So while the example above was analyzing the car domain, below is the result of an analysis of a review of a restaurant, specifically the text “It’s as quiet as hell in this restaurant”:

absa screenshot 2

Even though the text was quite similar to the car review, the model recognized that the words expressed a different sentiment because they were mentioned in a different context.

Aspect-based Sentiment Analysis in airlines

Now let’s see what we can find in the Tweets we collected about airlines. In the airlines domain, our endpoint recognizes 10 different aspects that people are likely to mention when talking about their experience with airlines.

absa airlines domain

Before we look at how people felt about each of these aspects, let’s take a look at which aspects they were actually talking about the most.

Noise is a big problem when you’re analyzing social media content. For instance when we analyzed our 25,000 Tweets, we found that almost two thirds had no mention of the aspects we’ve listed above. These Tweets mainly focused on things like online competitions, company marketing material or even jokes about the airlines. When we filtered these noisy Tweets out, we were left with 9,957 Tweets which mentioned one or more aspects.

The chart below shows which of the 10 aspects were mentioned the most.

On one hand it might come as a surprise to see aspects like food and comfort mentioned so infrequently – when you think about people giving out about airlines you tend to think of them complaining about food or the lack of legroom. On the other hand, it’s no real surprise to see aspects like punctuality and staff mentioned so much.

You could speculate that comfort and food are pretty standard across airlines (nobody expects a Michelin-starred airline meal), but punctuality can vary, so people can be let down by this (when your flight is late it’s an unpleasant surprise, which you would be more likely to Tweet about).

What people thought about each airline on key aspects

Now that we know what people were talking about, let’s take a look at how they felt. We’re going to look at how each airline performed on four interesting aspects:

  1. Staff – the most-mentioned aspect;
  2. Punctuality – to see which airline receives the best and worst sentiment for delays;
  3. Food – infrequently mentioned but a central part of the in-flight experience;
  4. Luggage – which airline gets the most Tweets about losing people’s luggage?

Staff

We saw in the mentions graph above that people mentioned staff the most when tweeting about an airline. You can see from the graph below that people are highly negative about airline staff in general, with a fairly equal level of negativity towards each airline except Lufthansa, which actually receives more positive sentiment than negative.


Punctuality

People’s second biggest concern was punctuality, and you can see below that the two US-based airlines score particularly bad on this aspect. Also, it’s worth noting that while Ryanair receives very negative sentiment in general, people complain about Ryanair’s punctuality less than any of the other airlines. This isn’t too surprising considering their exemplary punctuality record is one of their major USPs as an airline and something they like to publicize.


Food

We all know airline food isn’t the best, but when we looked at the sentiment about food in the Tweets, we found that people generally weren’t that vocal about their opinions on plane food. Lufthansa receives the most positive sentiment about this aspect, with their pretty impressive culinary efforts paying off. However it’s an entirely different story when it comes to the customer reaction towards United’s food, none of us have ever flown United here in the AYLIEN office, so from the results we got we’re all wondering what they’re feeding their passengers now.


Luggage

The last aspect that we compared across the airlines was luggage. When you take a look at the sentiment here, you can see that again Lufthansa perform quite well, but in this one Aer Lingus fares pretty badly. Maybe leave your valuables at home next time you fly with Ireland’s national carrier.

Ryanair and Lufthansa compared

So far we’ve shown just four of the 10 aspects our Aspect-based Sentiment Analysis feature analyzes in the airlines domain. To show all of them together, we decided to take two very different airlines and put them side by side to see how people’s opinions on each of them compared.

We picked Ryanair and Lufthansa so you can compare a “no frills” budget airline that focuses on short-haul flights, with a more expensive, higher-end offering and see what people Tweet about each.

First, here’s the sentiment that people showed towards every aspect in Tweets that mention Lufthansa.

Below is the same analysis of Tweets that mention Ryanair.

You can see that people express generally more positive sentiment towards Lufthansa than Ryanair.  This is no real surprise since this is a comparison of a budget airline with a higher-end competitor, and you would expect people’s opinions to differ on things like food and flight experience.

But it’s interesting to note the sentiment was actually pretty similar towards the two core aspects of air travel – punctuality and value.

The most obvious outlier here is the overwhelmingly negative sentiment about entertainment on Ryanair flights, especially since there is no entertainment on Ryanair flights. This spike in negativity was due to an incident involving drunk passengers on a Ryanair flight that was covered by the media on the day we gathered our Tweets, skewing the sentiment in the Tweets we collected. These temporary fluctuations are a problem inherent in looking at snapshot-style data samples, but from a voice-of-the-customer point of view they are certainly something an airline needs to be aware of.

This is just one example of how you can use our Text Analysis API to extract meaning from content at a large scale. If you’d like to use AYLIEN to extract insights from any text you have in mind, click on the image at the end of the post to get free access to the API and start analyzing your data. With the extensive documentation and how-to blogs, as well as detailed tutorials and a great customer support, you’ll have all the help you’ll need to get going in no time!





Text Analysis API - Sign up




0

For the next instalment of our monthly media roundup using our News API, we thought we’d take a look at the content that was shared most on social media in the month of May. Finding out what content performs well on each social network gives us valuable insights into what media people are consuming and how this varies across different networks. To get these insights, we’re going to take a look at the most-shared content on Facebook, LinkedIn and Reddit.

Together, the stories we analyzed for this post were shared over 10 million times last month. Using the News API, we can easily extract insights about this content in a matter of minutes. With millions of new stories added every month in near real-time, News API users can analyze news content at any scale for whatever topic they want to dig into.

Most Shared Stories on Each Social Network

Before we jump into all of this content, let’s take a quick look at what the top three most-shared stories on each social network were. Take particular note of the style of articles and the subject matter of each article and how they differ across each social network.

Most shared stories on Facebook in May

  1. Drowning Doesn’t Look Like Drowning,” Slate, 1,337,890 shares.
  2. This “All About That Bass” Cover Will Make Every Mom Crack Up,” Popsugar, 913,768 shares.
  3. Why ’80s Babies Are Different Than Other Millennials,” Popsugar, 889,788 shares.

1

Most shared stories on LinkedIn in May

  1. 10 Ways Smart People Stay Calm,” Huffington Post UK, 8,398 shares.
  2. Pepsi Turns Up The Heat This Summer With Release Of Limited-Edition Pepsi Fire,” PR Newswire, 7,769 shares.
  3. In Just 3 Words, LinkedIn’s CEO Taught a Brilliant Lesson in How to Find Great People,” Inc.com, 7,389 shares.

2

Most shared stories on Reddit in May:

  1. Trump revealed highly classified information to Russian foreign minister and ambassador,” The Washington Post, 146,534 upvotes.
  2. Macron wins French presidency by decisive margin over Le Pen,” The Guardian, 115,478 upvotes.
  3. Youtube family who pulled controversial pranks on children lose custody,” The Independent, 101,153 upvotes.

3

Content Categories

Even from the article titles alone, you can already see there is a difference between the type of stories that do well on each social network. Of course it’s likely you already knew this if you’re active on any of these particular social networks. To start our analysis, we decided to try and quantify this difference by gathering the most-shared stories on each network and categorizing them automatically using our News API to look for particular insights.

From this analysis, you can see a clear difference in the type of content people are more likely to share on each network.

LinkedIn

LinkedIn users predictably share a large amount of career-focused content. However, more surprisingly stories which fall into the Society category were also very popular on LinkedIn.

Most-shared stories by category on LinkedIn in May

Reddit

Reddit is a content-sharing website that has a reputation for being a place where you can find absolutely anything, especially more random, alternative content than you would find on other social media. So it might come as a bit of a surprise to see that over half of the most-shared content on Reddit falls into just two categories, Politics and News.

Most-shared stories by category on Reddit in May


Facebook

Not surprisingly our analysis, as shown in the pie chart below, shows that almost half of the most-shared stories on Facebook are about either entertainment or food.

Most-shared stories by category on Facebook in May


Note: As a reminder we’ve only analyzed the most shared, liked and upvoted content on each platform.

Topics and Keywords

So far we’ve looked at what categories the most shared stories fall into across each social channel, but we also wanted to dig a little deeper into the topics they discussed in order to understand what content did better on each network. We can do this by extracting keywords, entities and concepts that were mentioned in each story and see which were mentioned most. When we do this, you can see a clear difference between the topics people share on each network.

LinkedIn

Below, you can see the keywords from the most shared stories on LinkedIn. These keywords are mostly business-focused, which validates what we found with the categories feature above.

Keywords extracted from the most-shared stories on LinkedIn in May

Reddit

Likewise with Reddit, you can see below that the keywords validate what the categorization feature found – that most of the content is about politics and news.

Keywords extracted from the most-shared stories on Reddit in May

Facebook

However on Facebook the most popular content tends to include mentions of family topics, like “father” and “kids,” and “baby” (with the obligatory mentions of “Donald Trump,” of course). This doesn’t correspond with what we found when we looked at what categories the stories belonged to – Arts & Entertainment and Food made up almost 50% of the most-shared content. Take a look below at what keywords appeared most frequently in the most-shared content.

Keywords extracted from the most-shared stories on Facebook in May

In order to find out why there wasn’t as clear a correlation between keywords and categories like we saw on the other platforms, we decided to dive into where this most shared content on Facebook was coming from. Using the source domain feature on the stories endpoint, we found that over 30% of the most shared content was published by one publication – Popsugar. Popsugar, for those who don’t know, is a popular lifestyle media publisher whose content is heavily weighted towards family oriented content with a strong celebrity slant. This means a lot of the content published on Popsugar could be categorized as Arts and Entertainment, while also talking about families.

Most-shared stories by source on Facebook in May

Content Length

After we categorized the stories and analyzed what topics they discuss, we also thought it might be interesting to understand what type of content, long-form or short-form, performs best across each platform. We wanted to see if the length of an article is a good indicator of how content performs on a social network. Our guess was that shorter pieces of content might perform best on Facebook while longer articles would most likely be more popular on LinkedIn. Using the word count feature on the histograms endpoint, it’s extremely easy to understand the the relationship between an article’s popularity and it’s length.

For example, below you can see that the content people shared most on Facebook was usually between 0 and 100 words in length, with people sharing longer posts on LinkedIn and Reddit.

Word count of the most-shared stories on each platform

Conclusions

So to wrap up, we can come to some conclusions about what content people shared in May:

  1. People shared shorter, family-oriented and lighthearted content on Facebook;
  2. Longer, breaking news content involving Donald Trump dominated Reddit;
  3. On LinkedIn, people shared both short and long content that mainly focused on career development and companies.

If you’d like to try the News API out for yourself, click on the image below to start your free 14-day trial, with no credit card required.




News API - Sign up




0

Artificial Intelligence and Machine Learning play a bigger part in our lives today than most people can imagine. We use intelligent services and applications every day that rely heavily on Machine Learning advances. Voice activation services like Siri or Alexa, image recognition services like Snapchat or Google Image Search, and even self driving cars all rely on the ability of machines to learn and adapt.

If you’re new to Machine Learning, it can be very easy to get bogged down in buzzwords and complex concepts of this dark art. With this in mind, we thought we’d put together a quick introduction to the basics of Machine Learning and how it works.

Note: This post is aimed at newbies – if you know a Bayesian model from a CNN, head on over to the research section of our blog, where you’ll find posts on more advanced subjects.

So what exactly is Machine Learning?

Machine Learning refers to a process that is used to train machines to imitate human intuition – to make decisions without having been told what exactly to do.

Machine Learning is a subfield of computer science, and you’ll find it defined in many ways, but the simplest is probably still Arthur Samuel’s our definition from 1959: “Machine Learning gives computers the ability to learn without being explicitly programmed”. Machine Learning explores how programs, or more specifically algorithms, learn from data and make predictions based on it. These algorithms differ from traditional programs by not relying on strict coded instruction, but by making data-driven, informed predictions or decisions based on sample training inputs. Its applications in the real world are highly varied but the one common element is that every Machine Learning program learns from past experience in order to make predictions in the future.

Machine Learning can be used to process massive amounts of data efficiently, as part of a particular task or problem. It relies on specific representations of data, or “features” in order to recognise something, similar to how when a person sees a cat, they can recognize it from visual features like its shape, its tail length, and its markings, Machine Learning algorithms learn from from patterns and features in data previously analyzed.

Different types of Machine Learning

There are many types of Machine Learning programs or algorithms. The most common ones can be split into three categories or types:

    1. Supervised Machine Learning
    2. Unsupervised Machine Learning
    3. Reinforcement Learning

1. Supervised Machine Learning

Supervised learning refers to how a Machine Learning application has been trained to recognize patterns and features in data. It is “supervised”, meaning it has been trained or taught using correctly labeled (usually by a human) training data.

The way supervised learning works isn’t too different to how we learn as humans. Think of how you teach a child: when a child sees a dog, you point at it and say “Look! A dog!”. What you’re doing here essentially is labelling that animal as a “dog”. Now, It might take a few hundred repetitions, but after a while the child will see another dog somewhere and say “dog,” of their own accord. They do this by recognising the features of a dog and the association of those features with the label “dog” and a supervised Machine Learning model works in much the same way.

It’s easily explained using an everyday example that you have certainly come across. Let’s consider how your email provider catches spam. First, the algorithm used is trained on a dataset or list of thousands of examples of emails that are labelled as “Spam” or “Not spam”. This dataset can be referred to as “training data”. The “training data” allows the algorithm to build up a detailed picture of what a Spam email looks like. After this training process, the algorithm should be able to decide what label (Spam or Not spam) should be assigned to future emails based on what it has learned from the training set. This is a common example of a Classification algorithm – a supervised algorithm trained on pre-labeled data.

Screenshot (58)
Training a spam classifier

2. Unsupervised Machine Learning

Unsupervised learning takes a different approach. As you can probably gather from the name, unsupervised learning algorithms don’t rely on pre-labeled training data to learn. Alternatively, they attempt to recognize patterns and structure in data. These patterns recognized in the data can then be used to make decisions or predictions when new data is introduced to the problem.

Think back to how supervised learning teaches a child how to recognise a dog, by showing it what a dog looks like and assigning the label “dog”. Unsupervised learning is the equivalent to leaving the child to their own devices and not telling them the correct word or label to describe the animal. After a while, they would start to recognize that a lot of animals while similar to each other, have their own characteristics and features meaning they can be grouped together, cats with cats and dogs with dogs. The child has not been told what the correct label is for a cat or dog, but based on the features identified they can make a decision to group similar animals together. An unsupervised model will work in the same way by identifying features, structure and patterns in data which it uses to group or cluster similar data together.

Amazon’s “customers also bought” feature is a good example of unsupervised learning in action. Millions of people buy different combinations of books on Amazon every day, and these transactions provide a huge amount of data on people’s tastes. An unsupervised learning algorithm analyzes this data to find patterns in these transactions, and returns relevant books as suggestions. As trends change or new books are published, people will buy different combinations of books, and the algorithm will adjust its recommendations accordingly, all without needing help from a human. This is an example of a clustering algorithm – an unsupervised algorithm that learns by identifying common groupings of data.

Screenshot (40)
Clustering visualization

Supervised Versus Unsupervised Algorithms

Each of these two methods have their own strengths and weaknesses, and where one should be used over the other is dependent on a number of different factors:
The availability of labelled data to use for training

    Whether the desired outcome is already known
    Whether we have a specific task in mind or we want to make a program for very general use
    Whether the task at hand is resource or time sensitive

Put simply, supervised learning is excellent at tasks where there is a degree of certainty about the potential outcomes, whereas unsupervised learning thrives in situations where the context is more unknown.

In the case of supervised learning algorithms, the range of problems they can solve can be constrained by their reliance on training data, which is often difficult or expensive to obtain. In addition, a supervised algorithm can usually only be used in the context you trained it for. Imagine a food classifier that has only been trained on pictures of hot dogs – sure it might do an excellent job at recognising hotdogs in images, but when it’s shown an image of a pizza all it knows is that that image doesn’t contain a hotdog.


The limits of supervised learning – HBO’s Silicon Valley


Unsupervised learning approaches also have many drawbacks: they are more complex, they need much more computational power, and theoretically they are nowhere near as understood yet as supervised learning. However, more recently they have been at the center of ML research and are often referred to as the next frontier in AI. Unsupervised learning gives machines the ability to learn by themselves, to extract information about the context you put them in, which essentially, is the core challenge of Artificial Intelligence. Compared with supervised learning, unsupervised learning offers a way to teach machines something resembling common sense.

3. Reinforcement Learning

Reinforcement learning is the third approach that you’ll most commonly come across. A reinforcement learning program tries to teach itself accuracy in a task by continually giving itself feedback based on its surroundings, and continually updating its behaviour based on this feedback. Reinforcement learning allows machines to automatically decide how to behave in a particular environment in order to maximize performance based off ‘reward‘ feedback or a reinforcement signal. This approach can only be used in an environment where the program can take signals from its surroundings as positive or negative feedback.

Reinforcement Learning in action


Imagine you’re programming a self-driving car to teach itself to become better at driving. You would program it to understand certain actions – like going off the road for example – is bad by providing negative feedback as a reinforcement signal. The car will then look at data where it went off the road before, and try to avoid similar outcomes. For instance, if the car sees a pattern like when it didn’t slow down at a corner it was more likely to end up driving off the road, but when it slowed down this outcome was less likely, it would slow down at corners more.

Conclusion

So this concludes our introduction to the basics of Machine Learning. We hope it provides you with some grounding as you try to get familiar with some of the more advanced concepts of Machine Learning. If you’re interested in Natural Language Processing and how Machine Learning is used in NLP specifically, keep an eye on our blog as we’re going cover how Machine Learning has been applied to the field. If you want to read some in-depth posts on Machine Learning, Deep Learning, and NLP, check out the research section of our blog.





Text Analysis API - Sign up




0

Last Friday we witnessed the start of what has been one of the biggest worldwide cyber attacks in history, the WannaCry malware attack. While information security and hacking threats in general receive regular coverage in the news and media, we haven’t seen anything like the coverage around the WannaCry malware attack recently. Not since the Sony Playstation hack in 2011 have we seen as much media interest in a hacking event.

News outlets cover hacking stories quite frequently because they pose this kind of threat to people. However, when we look at the news coverage over the course of the past 12 months in the graph below, we can see that triple the average monthly story volume on malware was produced in the first three days of the attack alone.

In this blog, we’ll use our News API to look at the media coverage of WannaCry before the news of the attack broke and afterwards, as details of the attack began to surface.

Monthly count of articles mentioning “malware” or “ransomware” over the last 12 months

By analyzing the news articles published about WannaCry and malware in general, with the help of some visualizations we’re going to look at three aspects:  

  • warning signs in the content published before the attack;  
  • how the story developed in the first days of the attack;
  • how the story spread across social media channels.

WannaCry

At 8am CET on Friday May 12th, the WannaCry attack began, and by that evening it had infected over 50,000 machines in 70 countries. By the following Monday, that had risen to 213,000 infections, paralyzing computer systems in hospitals, factories, and transport networks as well as personal devices. WannaCry is a ransomware virus – it encrypts all of the data on computers it infects, with users only having their data decrypted after they had paid $300 or $600 ransom to the hackers. Users who have had their device infected can only see the screen below until they have paid the ransom.

WannaCry Screen

Source: CNN Money

In the first six days after the attacks, the hackers have received over USD$90,000 through over 290 payments (you can track the payments made to the known Bitcoin wallets here via a useful Twitter bot created by @collinskeith), which isn’t a fantastic conversion rate considering they managed to infect over 200,000 computers. Perhaps if the hackers had done their market research they would have realized that their target audience – those still using Windows XP – are more likely to still write cheques than pay for things with Bitcoin.

The attack was enabled by tools that exploit security vulnerabilities in Windows called DoublePulsar and EternalBlue. These tools essentially allow someone to access every file on your computer by avoiding the security built into your operating system. The vulnerabilities were originally discovered by the National Security Agency (NSA) in the US, but were leaked by a hacker group called The Shadow Brokers in early April 2017.

The graph below, generated using the time series feature in our News API, shows how the coverage of ransomware and malware in articles developed over time. The Shadow Brokers’ dump in early April was reported on and certainly created a bit of noise, however it seems this was forgotten or overlooked by almost everyone until the attack itself was launched. The graph then shows the huge spike in news coverage once the WannaCry attack was launched.


Volume of articles mentioning “malware” or “ransomware” in April and May

Monitoring the Media for Warning Signs

Since WannaCry took the world by such surprise, we thought we’d dig into the news content in the weeks prior to the attack and see if we could find any signal in the noise that would have alerted us to a threat. Hindsight is 20/20, but an effective media monitoring strategy can give an in-depth insight into threats and crises as they emerge.

By simply creating a list of the hacking tools dumped online in early April and tracking mentions of these tools, we see definite warning signs. Of these 30 or so exploits, DoublePulsar and EternalBlue were the only ones mentioned again before the attack, and these ended up being the ones used to enable the WannaCry attack.

Mentions of each of the exploit tools dumped in April and May

 

We can then use the stories endpoint to collect the articles that contributed to the second spike in story volumes, around April 25th. Digging into these articles provides another clear warning: the articles collected cover reports by security analysts estimating that DoublePulsar had been installed on 183,000 machines since the dump ten days earlier (not too far off the over 200,000 machines WannaCry has infected). Although these reports were published in cybersecurity publications, news on the threat didn’t make it to mainstream media until the NHS was hacked and hospitals had to send patients home.

DoublePulsar article

Story on the spread of DoublePulsar and EternalBlue in SC Magazine

Trends in the Coverage

As it emerged early on Friday morning that malware was spreading through personal computers, private companies and government organizations, media outlets broke the story to the world as they gained information. Using the trends endpoint of our News API, we decided it would be interesting to try and understand what organizations and companies were mentioned in the news alongside the WannaCry attack. Below you can see the most mentioned organisations that were extracted from news articles about the attack.

Organisations mentioned in WannaCry stories

The next thing we wanted to do was to try and understand how the story developed over time and to illustrate how the media focus shifted from “what,” to “how,” to “who” over a period of a few days.

The focus on Friday was on the immediate impact on the first targets, like the NHS and Telefonica, but as the weekend progressed the stories began to focus on the method of attack, with many mentions of Windows and Windows XP (the operating system that was particularly vulnerable). On Monday and Tuesday the media turned then their focus to who exactly was responsible and as you can see from the visualization below mentions of  North Korea, Europol, and the NSA began to surface in the news stories collected.
Take a look at the chart below to see how the coverage of the entities changed over time.

 

Mentions of organisations on WannaCry stories published from Friday to Tuesday

 

Most Shared Stories about WannaCry

The final aspect of the story as a whole we focused on was how news of the threat spread across different social channels. Using the stories endpoint, we can rank WannaCry stories by their share counts across social media to get an understanding into what people shared about WannaCry. We can see below that people were very interested in the young man who unintentionally found a way to prevent the malware from attacking the machines it installed itself on. This contrasts quite a bit with the type of sources and subject matter of the articles from before the attack began.

 

Facebook

  1. The 22-year-old who saved the world from a malware virus has been named,” Business Insider. 33,800 shares.
  2. ‘Accidental hero’ finds kill switch to stop spread of ransomware cyber-attack,” MSN.com. 28,420 shares.
  3. Massive ransomware attack hits 99 countries,” CNN. 13,651 shares.

 

LinkedIn

  1. A Massive Ransomware ‘Explosion’ Is Hitting Targets All Over the World,” VICE Motherboard. 3,612 shares.
  2. Massive ransomware attack hits 99 countries,” CNN. 2,963 shares.
  3. Massive ransomware attack hits 74 countries,” CNN. 2,656 shares.

 

Reddit

  1. ‘Accidental hero’ finds kill switch to stop spread of ransomware cyber-attack,” MSN.com. 24,497 upvotes.
  2. WannaCrypt ransomware: Microsoft issues emergency patch for Windows XP,” ZDNet. 4,454 upvotes.
  3. Microsoft criticizes governments for stockpiling cyber weapons, says attack is ‘wake-up call’” CNBC. 3,403 upvotes.

This was a quick analysis of the media reaction to the WannaCry attack using our News API. If you’d like to try it for yourself you can create your free account and start collecting and analyzing stories. Our News API is the most powerful way of searching, sourcing, and indexing news content from across the globe. We crawl and index thousands of news sources every day and analyze their content using our NLP-powered Text Analysis Engine to give you an enriched and flexible news data source.




News API - Sign up




0

At Aylien we are using recent advances in Artificial Intelligence to try to understand natural language. Part of what we do is building products such as our Text Analysis API and News API to help people extract meaning and insight from text. We are also a research lab, conducting research that we believe will make valuable contributions to the field of Artificial Intelligence, as well as driving further product development (take a look at the research section of our blog to see some of our interests).

We are delighted to announce a call for applications from academic researchers who would like to collaborate with our talented research team as part of the Science Foundation Ireland Industry Fellowship programme. For researchers, we feel that this is a great opportunity to work in industry with a team of talented scientists and engineers, with the resources and infrastructure to support your work.

We are especially interested in collaborating on work in these areas (but we’re open to suggestions):

  • Representation Learning
  • Domain Adaptation and Transfer Learning
  • Sentiment Analysis
  • Question Answering
  • Dialogue Systems
  • Entity and Relation Extraction
  • Topic Modeling
  • Document Classification
  • Taxonomy Inference
  • Document Summarization
  • Machine Translation

Details

A notable change from the previous requirements for the SFI Industry Fellowship programme is that you are now also eligible to apply if you hold  a PhD awarded by an Irish university (wherever you  are currently living). You can work with us under this programme if you are:

  • a faculty or postdoctoral researcher currently working in an eligible Irish Research Body;
  • a postdoctoral researcher having held a research contract in an eligible Irish Research Body; or
  • the holder of a PhD awarded by an Irish Research Body.

The project allows for an initial duration of 12 months full-time work, but this can be spread over up to 24 months at part-time, depending on what suits you. We also hope to continue collaboration afterwards. The total funding available from the SFI is €100,000.

The final application date is July 6th 2017, but as we will work with you on the application, we encourage you to get in touch as soon as possible if you are interested.

Must haves:

  • A faculty or postdoctoral position at an eligible Irish Research Body, a previous research contract in an eligible Irish Research Body, or a PhD awarded by an Irish Research Body (in the areas of science, engineering or mathematics).
  • Strong knowledge of at least one programming language, and general software engineering skills (experience with version control systems, debugging, testing, etc.).
  • The ability to grasp new concepts quickly.

Nice to haves:

  • An advanced knowledge of Artificial Intelligence and its subfields.
  • A solid background in Machine Learning.
  • Experience with the scientific Python stack: NumPy, SciPy, scikit-learn, TensorFlow, Theano, Keras, etc.
  • Experience with Deep Learning.
  • Good understanding of language and linguistics.
  • Experience with non-English NLP.

About the Research Team at AYLIEN

We are a team of research scientists and research engineers that includes both PhDs and PhD students, who work in an open and transparent way.

If you join us, you will get to work with a seasoned engineering team that will allow you to take successful algorithms from prototype into production. You will also have the opportunity to work on many technically challenging problems, publish at conferences, and contribute to the development of state-of-the-art Deep Learning and natural language understanding applications.

We work primarily on Deep Learning with TensorFlow and the rest of the the scientific Python stack (Numpy, SciPy, scikit-learn, etc.).

How to apply

Please send your CV with a brief introduction to jobs@aylien.com, and be sure to include links to any code or publications that are publicly available.

0

Last month was full of unexpected high-profile publicity disasters, from passengers being dragged off planes to Kendall Jenner failing to solve political unrest.  For this month’s Monthly Media Roundup we decided to collect and analyze news stories related to three major events and try to understand the media reaction to each story, while also uncovering the impact this coverage had on the brands involved.

In the roundup of the month’s news, we’ll cover three major events:

  1. United Airlines’ mishandling of negative public sentiment cut their market value by $255 million.
  2. Pepsi’s ad capitalizing on social movements shows the limits of appealing to people’s consciousness for advertising.
  3. The firing of Bill O’Reilly shows how brands have become aware of the importance of online sentiment.

1: United Airlines

On Monday, April 10th, a video went viral showing a passenger being violently dragged off a United Airlines flight. On the same day, United CEO Oscar Munoz attempted to play down the controversy by defending staff and calling the passenger “disruptive and belligerent”. With investors balking at the tsunami of negative publicity that was compounded by Munoz, the following day United’s share price fell by over 1%, shaving $255 million off their market capitalization by the end of trading.
We collected relevant news articles published in April using a detailed search query with our News API. By analyzing the volume of articles we collected and
analyzing the sentiment of each article, we were able to get a clear picture of how the media responded to the video and subsequent events:

Media Reaction to United Airlines Controversy

The volume of stories published shows how quickly the media jumped on the story (and also that Munoz’s statement compounded the issue), while the sentiment graph shows just how negative all that coverage was. The key point here is that the action United took in dealing with the wave of negative online sentiment – not listening to the customer – led to their stock tumbling. Investors predicted that ongoing negative sentiment on such a scale would lose potential customers, and began offloading shares in response.

Most shared stories about United in April

Facebook

  1. United Airlines Stock Drops $1.4 Billion After Passenger-Removal Controversy” – Fortune, 57,075 shares
  2. United Airlines says controversial flight was not overbooked; CEO apologizes again” – USA Today, 43,044 shares

LinkedIn

  1. United Airlines Passenger Is Dragged From an Overbooked Flight” – The New York Times, 1,443 shares
  2. When a ticket is not enough: United Airlines forcibly removes a man from an overbooked flight” – The Economist, 1,430 shares

Reddit

  1. Simon the giant rabbit, destined to be world’s biggest, dies on United Airlines flight” – Fox News, 62,830 upvotes
  2. Passengers film moment police drag man off a United Airlines plane” – Daily Mail, 25,142 upvotes

2: Pepsi

In contrast with United’s response, Pepsi’s quick reaction to online opinion paid off this month as they faced their own PR crisis. On April 3rd, Pepsi released an ad that was immediately panned for trying to incorporate social movements like Black Lives Matter into a soft drink commercial that prompted ridicule online.
After using our News API to collect every available article on this story and analyzing the sentiment of each article, we can get a picture of how the media reacted to the ad. This lets us see that on the day after the ad was launched, there were over three times more negative articles mentioning Pepsi than positive ones.

Media Reaction to Pepsi’s Kendall Jenner Ad

As a company that spends $2.5 billion annually on advertising, Pepsi were predictably swift in their response to bad publicity, pulling the ad from broadcast just over 24 hours after it was first aired.

Even though this controversy involved a major celebrity, the Pepsi ad debacle has actually been shared significantly less compared to the other PR disasters. By using our News API to rank the most shared articles across major social media platforms, we can see that the story gathered a lot less pace than those covering the United scandal.

Most shared articles about Pepsi in April

Facebook

  1. Twitter takes Pepsi to task over tone-deaf Kendall Jenner ad” – USA Today, 19,028 shares
  1. Hey Pepsi, Here’s How It’s Done. Heineken Takes On Our Differences, and Nails It” – AdWeek, 16,465 shares

LinkedIn

  1. Heineken Just Put Out The Antidote to That Pepsi Kendall Jenner Ad” – Fast Company, 1,833 shares
  2. Pepsi Just Released An Ad That May Be One Of The Worst Ads Ever Made (And That’s Saying Something)” – Inc.com, 1,192 shares

Reddit

  1. Pepsi ad review: A scene-by-scene dissection of possibly the worst commercial of all time” – Independent UK, 58 upvotes
  2. Pepsi pulls Kendall Jenner advert amid outcry” – BBC, 58 upvotes

3: Fox Firing Bill O’Reilly

On April 1st, the New York Times published an article detailing numerous previously unknown sexual harassment cases brought against Bill O’Reilly. O’Reilly, who was Fox’s most popular host, drew an average of 3 million views to his prime-time slot. Though his ratings were unscathed (they actually rose), advertisers began pulling their ads from O’Reilly’s slot in response to the negative PR the host was receiving.
We sourced every available story about Bill O’Reilly published in April, and analyzed the sentiment of each article. Below we can see just how negative this coverage was over the course of the story.

Media Reaction to Bill O’Reilly Controversy

This was not the first time that O’Reilly had been accused of sexual harassment, having been placed on leave in 2004 for the same reason. In both 2004 and April 2017, O’Reilly’s viewer ratings remained unhurt by the scandals. What is different in 2017 is that brands are far more aware of the “Voice of the Customer” – social media and online content representing the intentions of potential customers. This means negative coverage and trends like #DropOReilly have a considerable effect on brands’ marketing behaviour.



Most-mentioned Keywords in Articles about Bill O’Reilly in April

By analyzing the content from every article about Bill O’Reilly in April, we can rank the most frequently used Entities and Keywords across the collection of articles. Not surprisingly, our results show us that the coverage was dominated by the topic of sexual harassment and Fox News. But our analysis also uncovered other individuals and brands that were mentioned in news articles as being tied to the scandal. Brands like BMW and Mercedes took swift action to distance themselves from the backlash by announcing they were pulling advertising from O’Reilly’s show in an attempt to preempt any negative press.

Most shared articles about Bill O’Reilly in April

Facebook

  1. Bill O’Reilly is officially out at Fox News” – The Washington Post, 63,341 shares
  2. Bill O’Reilly Is Out At Fox News” – NPR, 50,895 shares

LinkedIn

  1. Bill O’Reilly Out At Fox News” – Forbes, 861 shares
  2. Fox Is Preparing to Cut Ties with Bill O’Reilly” – The Wall Street Journal, 608 shares

Reddit

  1. Sources: Fox News Has Decided Bill O’Reilly Has to Go” – New York Magazine, 80,436 upvotes
  2. Fox News drops Bill O’Reilly in wake of harassment allegations” – Fox News, 12,387 upvotes

We hope this post has given you an idea of the importance of media monitoring and content analysis is from a PR and branding point of view. Being able to collect and understand thousands of articles in a matter of minutes means you can quickly assess media reaction to PR crises as they unfold.

Ready to try the News API for yourself? Simply click the image below to sign up for a 14-day free trial.





News API - Sign up




0

Screen Shot 2017-04-25 at 19.38.39

 

 

 

 

 

 

 

 

 

2017 looks set to be a big year for us here on Ormond Quay – with AYLIEN in hyper-growth mode, we’ve added six new team members in the first four months of the year, and that looks set to continue. After this period of quick growth, we thought we’d take stock and introduce you to the newest recruits.

Say hello to our newest recruits!

Mahdi

Mahdi: NLP Research Engineer

Mahdi became an open-source contributor at age 16, working on Firefox Developer Tools and other projects you can find on his GitHub. At 18, he was hired as a full-stack developer to work on browser extensions and mobile apps. Mahdi just started at AYLIEN as a Natural Language Processing Research Engineer focusing on Deep Learning, while also working as a full-stack developer on our web apps. He blogs about programming (and life in general) on theread.me.

Mahdi is a serious outdoorsman who can be found hiking in the hills and practicing Primitive Living. He also loves learning languages and reading, which provides him with the raw material to fill our Slack loading messages with some supremely inspirational quotes!

Demian

Demian: NLP Research Intern

Demian comes from Braunschweig in Central Germany and has completed a degree in Computational Linguistics in the University of Heidelberg. As part of his degree he studied NLP and Artificial Intelligence in information extraction, and he is already familiar with Dublin from an Erasmus year spent in Trinity College. Demian previously worked in the Forensic Department of PwC in Germany, and here at AYLIEN he is going to research document summarization and event extraction for our News API.

Besides being a proficient coder, Demian is an avid painter and reader, and can be found running in Dublin’s parks.

Sylver

Sylver: Data Management Intern

Growing up between Dublin and Seattle, Sylver swapped one rainy town with a thriving tech scene for another. She is currently studying Legal Practice and Procedures, and before starting with us here at AYLIEN she was an editor of everything from novels to academic papers. Here at AYLIEN Sylver works on maintaining and managing our datasets and models.

A previous owner of 10 snakes (at the same time), Sylver spends her spare time caring for exotic pets, and is interested in reading, alternative modelling, and fitness.

Hosein

Hosein: Web Designer

Hosein is a creative designer with 3 years experience in UI design and front-end development, having previously worked with other startups and tech companies. A newcomer to NLP, Hosein is designing the AYLIEN website and web apps, and is also developing our front-ends.

While he’s away from his laptop, Hosein is usually out taking photographs and finding out more about cameras and photography.

Erfan

Erfan: NLP Research Engineer

Erfan holds a Bachelor’s Degree in Software Engineering. He has been researching computer vision for three years and you can read about his research on his blog. For his thesis, he used Deep Neural Nets to study the joint embedding of image and text, and at AYLIEN he is going to research and work on using memory-augmented neural nets, focusing on question-answering.

Will

Will: Content Marketing Intern

From the comparatively less exotic background of Dublin, Will is a Classics graduate who completed a Master’s in Digital Humanities at Trinity College, where he was introduced to NLP when he tried to write some code to index where authors use Latin words across English Literature. At AYLIEN, he is joining the Sales, Marketing, and Customer Success team to bolster our content creation and distribution efforts, and is even writing this exact sentence at this very moment in time.

Outside of AYLIEN, Will is an avid reader and learner of languages, and when he’s outside, he can be found running or hiking.

Come work with us!

So that sums up our new recruits – a pretty diverse group who all gravitated towards languages and programming. If you think you’d like to join us, take a look at aylien.com/jobs, email us at jobs@aylien.com, or call in for a fresh cup of coffee. We’re always interested in talking to anyone working on or studying NLP, Computational Linguistics or Machine Learning.




News API - Sign up




0

Introduction

In this post, AYLIEN NLP Research Intern, Mahdi, talks us through a quick experiment he performed on the back of reading an interesting paper on evolution strategies, by Tim Salimans, Jonathan Ho, Xi Chen and Ilya Sutskever.

Having recently read Evolution Strategies as a Scalable Alternative to Reinforcement Learning, Mahdi wanted to run an experiment of his own using Evolution Strategies. Flappy Bird has always been among Mahdi’s favorites when it comes to game experiments. A simple yet challenging game, he decided to put theory into practice.

Training Process

The model is trained using Evolution Strategies, which in simple terms works like this:

  1. Create a random, initial brain for the bird (this is the neural network, with 300 neurons in our case)
  2. At every epoch, create a batch of modifications to the bird’s brain (also called “mutations”)
  3. Play the game using each modified brain and calculate the final reward
  4. Update the brain by pushing it towards the mutated brains, proportionate to their relative success in the batch (the more reward a brain has been able to collect during a game, the more it contributes to the update)
  5. Repeat steps 2-4 until a local maximum for rewards is reached.

At the beginning of training, the bird usually either drops too low or jumps too high and hits one of the boundary walls, therefore losing immediately with a score of zero. In order to avoid scores of zero in training, which would means there won’t be a measure of success among brains, Mahdi set a small 0.1 score for every frame the bird stays alive. This way the bird learns to avoid dying at the first attempt. He then set a score of 10 for passing each wall, so the bird tries not only to stay alive, but to pass as many walls as possible.

The training process is quite fast as there is no need for backpropagation, and it is also not very costly in terms of memory as there is no need to record actions, as it is in policy gradients.

The model learns to play pretty well after 3000 epochs, however it is not completely flawless and it rarely loses in difficult cases, such as when there is a high difference between two wall entrances.

Here is a demonstration of the model after 3000 epochs

(~5 minutes on an Intel(R) Core(TM) i7-4770HQ CPU @ 2.20GHz):


Use the controls to set speed level or to restart

Web version

For ease of access, Mahdi has created a web version of the experiment which can be accessed here.

Try it yourself

Note: You need python3 and pip for installing and running the code.

First, download or clone the repository:

git clone https://github.com/mdibaiee/flappy-es.git


Next, install dependencies (you may want to create a virtualenv):

pip install -r requirements

The pretrained parameters are in a file named load.npy and will be loaded when you run train.py or demo.py

train.py will train the model, saving the parameters to saves/<TIMESTAMP>/save-<ITERATION>.

demo.py shows the game in a GTK window so you can see how the AI actually plays.


play.py if you feel like playing the game yourself, space: jump, once lost, press enter to play again.

Notes

It seems that training past a maximum point leads to a reduction in performance. Learning rate decay might help with this. Mahdi’s interpretation is that after finding a local maximum for accumulated reward and being able to receive high rewards, the updates become pretty large and will pull the model too much to different sides, thus the model will enter a state of oscillation.

To try it yourself, there is a long.npy file, rename it to load.npy (backup load.npy before doing so) and run demo.py, you will see the bird failing more often than not. long.py was trained for only 100 more epochs than load.npy.

0

Introduction

Welcome to the second installment in our series of monthly posts where we’ll be showcasing our News API by looking back at online news stories, articles and blog posts to uncover emerging insights and trends from topical categories.

For our February review, we looked at three IAB categories: Arts & Entertainment, Science and Politics.

For March, we’ve decided to narrow our focus a little further by looking at IAB subcategories to give you an idea of just how specific and granular you can be when sourcing and analyzing content through the News API. With this in mind, we’ve gone with the following three subcategories:

  1. Cell phones (subcategory of Tech & Computing)
  2. Boxing (subcategory of Sports)
  3. Stocks (subcategory of Personal FInance)

and for each subcategory we have performed the following analysis;

  • Publication volumes over time
  • Top stories
  • Most mentioned topics
  • Most shared stories on social media

Try it yourself

We’ve included code snippets for each of the analyses above so you can follow along or modify to create your own search queries.

If you haven’t already signed up to our News API you can do so here with a free 14 day News API trial.

1. Cell phones

The graph below shows publication volumes in the Cell phones subcategory throughout the month of March 2017.

Note: All visualizations are interactive. SImply hover your cursor over each to explore the various data points and information.

Volume of stories published: Cell phones

From the graph above we can see a number of spikes indicating sharp rises in publication volumes. Let’s take a look at the top 3;

Top stories

The three stories that contributed to the biggest spikes in news publication volumes;

  1. Samsung release their latest flagship phone, the Galaxy S8.
  2. The UK introduces a loss-of-license punishment for new drivers caught using their cell phones while driving.
  3. HTC reveal a limited edition version of their U Ultra smart phone.

It will perhaps come as no surprise to see one of the world’s top smartphone manufacturers, Samsung, getting the most media attention with the launch of their latest flagship model. In comparison, rivals HTC failed to generate the same level of hype around their latest model. However, by releasing a teaser about a surprise product release on March 15 they still managed to generate two of the top four publication volume spikes within the cell phone category in March.

Try it yourself – here’s the query we used for volume by category

Read more: We looked at Samsung’s recent exploding battery crisis to highlight how news content can be analyzed to track the voice of the customer in relation to crisis prevention and damage limitation.

Most mentioned topics

From the 7,000+ articles we sourced from the Cell phones category in March we looked at the most mentioned topics;

Try it yourself – here’s the query we used for most mentioned topics

Most shared on social media

What were the most shared stories on social media? We analyzed share counts from Facebook, LinkedIn and Reddit to see what type of content is performing best on each channel.

Facebook

  1. Man dies charging iPhone while in the bath (BBC. 26,072 shares)
  2. US bans electronic devices on flights from eight Muslim countries (The Independent. 25,886 shares)

Linkedin

  1. Samsung tries to reclaim its reputation with the Galaxy S8 (Washington Post. 890 shares)
  2. It’s Possible to Hack a Phone With Sound Waves, Researchers Show (NY Times. 814 shares)

Reddit

  1. Samsung confirms the Note 7 is coming back as a refurbished device (The Verge. 7,193 votes)
  2. The Galaxy S8 will be Samsung’s biggest test ever (The Verge. 4,981 votes)

Try it yourself – here’s the query we used for social shares

2. Boxing

We sourced a total of 9,000+ articles categorized under Boxing and found that what goes on outside the ring can garner just as much (if not more) media interest than what happens in it.

Volume of stories published: Boxing

Top stories

The three stories that contributed to the biggest spikes in news publication volumes;

  1. Heavyweight bout between David Haye and Tony Bellew.
  2. Floyd Mayweather urges the UFC to allow him and Conor McGregor to fight.
  3. Middleweight bout between Gennady Golovkin and Daniel Jacobs.

The two biggest fights in world boxing during the month of March are clearly represented by publication spikes in the chart above, particularly the heavyweight clash between Haye and Bellew. However, and as we mentioned, it’s not all about what happens in the ring.

The second largest spike we see above was the result of Floyd Mayweather, who hasn’t fought since September 2015, pleading with the UFC to allow a ‘superfight’ with Conor McGregor to go ahead. Neither Mayweather or McGregor have competed recently, nor have they any future fights scheduled, yet they still find themselves as the two most discussed individuals in this category. The bubble chart below showing the most mentioned topics from the boxing category further highlights this.

Most mentioned topics

Most shared on social media

Facebook

  1. Floyd Mayweather ‘officially out of retirement for Conor McGregor’ fight (FOX Sports. 56,951 shares)
  2. Bad refs, greedy NBF officials frustrating boxers – Apochi (Punchng. 42,367 shares)

Linkedin

  1. David Haye has Achilles surgery after Tony Bellew defeat (BBC. 234 shares)
  2. David Haye rules out retirement as he targets Tony Bellew rematch (BBC. 130 shares)

Reddit

  1. Teenage kickboxer dies after Leeds title fight (BBC. 1,502 shares)
  2. Muhammad Ali family vows to fight Trump’s ‘Muslim ban’ after airport detention (The Independent. 1,147 shares)

3. Stocks

The graph below shows publication volumes in the Stocks subcategory throughout the month of March 2017. In total we collected just over 30,000 articles.

Volume of stories published: Stocks

Top stories

The three stories that contributed to the biggest spikes in news publication volumes;

  1. Retailer Target sees stock drop by 13.5% after consumers boycott their pro-transgender stance.
  2. The US Federal Reserve increases interest rates, adding further pressure to housing market.
  3. Oil drops below US$53 as report shows rising US crude stockpiles

Most mentioned locations

Rather than focusing solely on extracted topics for this category, we thought it would be interesting to separate mentions of both locations and organizations. The chart above shows the most mentioned locations from all 30,000 articles published under the Stocks subcategory in March:

Most mentioned organizations

The chart above shows the top mentioned organizations including well known banks, investment firms and sources. It is interesting to see the likes of Facebook, Twitter and Snapchat in the mix also.

In March we saw Barclays declare Facebook as “the stock to town for the golden age of mobile”, referring to the upcoming 3-5 year period. Earlier in the month, Snapchat closed their first day of public trading up 44% at $24.48 a share.

Most shared on social media

Facebook

  1. Trump’s Approval Rating Hits New Record Low (Slate. 39,582 shares)
  2. Target Retailer Hits $15 Billion Loss Since Pro-Transgender Announcement (Breitbart. 30,107 shares)

Linkedin

  1. How on earth did India come up with these GDP numbers? (QZ. 2,579 shares)
  2. Home Prices in 20 U.S. Cities Rise at Fastest Pace Since 2014 (Bloomberg. 1,601 shares)

Reddit

  1. Bernie Sanders and Planned Parenthood are the most popular things in America, Fox News finds (The Week. 28,075 votes)
  2. GameStop Is Going to Close at Least 150 Stores (Fortune. 4,982 votes)

Conclusion

We hope that this post has given you an idea of the kind of in-depth and precise analyses that our News API users are performing to source and analyze specific news content that is of interest to them.

Ready to try the News API for yourself? Simply click the image below to sign up for a 14-day free trial.





News API - Sign up




0

PREVIOUS POSTSPage 2 of 18NEXT POSTS