###### Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!

Breakthroughs in NLP research are creating huge value for people every day, supercharging technologies from search engines to chatbots. The work that makes these breakthroughs possible is done in two silos – academia and industry. Researchers in both of these silos produce work that advances the field, and frequently collaborate to generate innovative research.

Contributing to this research is why we have such a heavy R&D focus at AYLIEN, with six full-time research scientists out of a total team of 16. The research team naturally has strong ties with academia – some are completing PhDs with the work they are carrying out here, while others already hold one. Academia is also represented on our advisory board and in the great people who have become our mentors.

To further deepen these ties with academia, we’re delighted to announce our first Industry Fellowship in association with Science Foundation Ireland, with Dr. Ian Wood of NUIG. Ian will be based in our Dublin office for one year starting in September. SFI’s goal with this fellowship is to allow industry and academia to cross-pollinate by exchanging ideas and collaborating on research. This placement will allow us to contribute to and learn from the fantastic work that Insight Centre in NUIG are doing, and we’re really excited to open up some new research windows where our team’s and Ian’s interests overlap.

Ian is a postdoctoral researcher at the Insight Centre for Data Analytics, with an incredibly interesting background – a mixture of pure Mathematics, Psychology, and Deep Learning. His research is focused on how the emotions of entire communities change over time, which he researches by creating language models that detect the emotions people express on social media. For his PhD, he analyzed Tweets produced by pro-anorexia communities over three years and tracked their emotions, and showed that an individual’s actions are much more driven by their surrounding community than is generally accepted. Continuing this research, Ian now specializes in finding new ways to build Machine Learning and Deep Learning models to analyze emotions in online communities.

Ian’s placement is mutually beneficial on two levels. First, Ian’s experience in building language models for emotion analysis is obviously beneficial to us, and we can offer Ian a cutting edge research infrastructure and the opportunity to learn from our team in turn. But we’re also really excited at the possibility of opening up new research areas based on common interests, for example by building on existing research between Ian and our PhD student, Sebastian. Ian’s research into reducing dimensionality in data sets crosses over with Sebastian’s work into Domain Adaptation in a really interesting way, and we’re excited that this could open up a new research area for us to work on.

Outside of AYLIEN, Ian also speaks four languages, he was a professional musician (but that was in a previous life, he tells us), and he’s also sailed across the Atlantic in a small boat, so he’ll hopefully have some input into the next AYLIEN team-building exercises…

Welcome to the team, Ian!

If you want to find out more about the Fellowship, check out the LinkedIn group, and if your research interests overlap with ours in any way, drop us a line at jobs@aylien.com – we love hearing from other researchers!

In 2017, video content is becoming ever more central to how people consume media. According to research by HighQ, this year around 30% of smartphone users will watch video content on their device at least once a day. In addition to this, people will spend on average an extra two minutes browsing sites that feature video content compared with sites that do not. For this reason, video content is an important component of driving up revenues for online news publishers, since keeping your audience on your site allows you to sell more ads.

But even though we can find great market research on consumer behavior around video content, we couldn’t find an answer to the following question — what type of video content is the news industry publishing to capitalize on this? For example, how much video content is actually being published? Are some publishers dominating video content? And are some subjects being supplemented with videos more regularly than others? Knowing this would allow us to understand what areas of the online news industry are set to flourish in the coming years with the growing emphasis on video.

We decided to use the News API to look into this question. Last month, our API crawled, analyzed, and indexed 1,344,947 stories as they were published. One of the metadata points that it analyzed was how many images and videos were embedded on the page. So for this blog, we’ll analyze the 1.3 million stories our News API gathered in July to find answers to the following questions:

1. How many of the stories published last month featured video content?
2. What were the stories with video content about?
3. Which news organizations published the most video content?

## 1. How many stories published last month contained video content?

To get an idea of how far the video medium has spread into the online news industry, we need to find how much video content was used by news publishers last month. To do this, we used the News API’s Time Series endpoint to sort the stories published in July according to how many videos they contained. We then visualized the results to show how many stories contained no videos, how many contained one video, and how many contained more than one. Take a look below at what we found:

As you can see, 96% of stories published last month did not contain any video content, whereas just under 4% contained one video or more. We found this interesting — while HighQ found that almost 30% of smartphone users will watch video content online at least once per day, we can see here that barely 3.5% of news content published last month contained a video. This isn’t really optimal for an industry that relies on clicks for ad revenue.

But let’s focus on the news stories that contained video content. If we knew what these stories were about, we would have a good idea about what areas of online news are likely to fare well, since these areas likely account for a large proportion of ad revenue, and are therefore likely to grow. To look into this, we decided to try to understand what the stories containing video content were about.

## 2. What were the stories containing video about?

Knowing that only around one out of every thirty stories contained video content last month is interesting, but it begs the question of what these stories were about. To answer this question, we used the Trends endpoint to analyze the 43,134 stories that contained one video and see what subjects each one was about.

One of the pieces of information our News API extracts is topics that are discussed in the story, and which categories the story fits into, based on two taxonomies. For this visualization, we’ll use the advertising industry’s IAB-QAG taxonomy. Take a look below at which categories contained the most video content:

You can see that the Entertainment category had the most stories with video content accompanying them. This isn’t surprising to us at first, as we have all seen articles about celebrities with annoying videos that play automatically. But if you remember last month’s media roundup, you’ll remember that the Sports and Law, Government, and Politics categories produced by far the highest volumes of content (the Sports category alone published over double the content of the Entertainment category). This means that not only are there more videos about entertainment, but also that stories about entertainment are much more likely to contain a video than stories about politics.

So now we know which subject categories video content appeared in the most. But with the News API, we can go one step further and see exactly what people were talking about in the stories that contained a video. To do this, we used the Trends endpoint again to extract the entities mentioned the titles of these stories. Take a look at the chart below to see what people were talking about:

Here you can see exactly what the stories containing videos were about. The single biggest subject that was accompanied by a video was Love Island, a reality TV show. But you can also see that large soccer clubs are well represented on the chart. If you think back to last month’s roundup again, you’ll remember the huge reach and popularity of the top soccer clubs, even during their off-season. The chart above shows that these large soccer clubs are also being covered more with video content than other entities, with publishers obviously trying to leverage this reach to attract people to the stories they publish.

With large soccer clubs dominating both regular news content and video news content, and with ad revenues for video content being so valuable, these soccer clubs look like they have a bright future in terms of media content. Since the clubs benefit financially from media coverage through things like player image rights and viewership of games, large transfer fees like the \$263 million PSG are going to pay for Neymar don’t look so crazy.

## 3. Who were the biggest publishers of video content?

As we mentioned in the introduction, we want to find out which publishers are making the quickest transition to video-based content, as this has a knock-on effect on site viewership, and therefore ad revenues. Knowing which players are leading industry trends like this is a good indicator of which ones are going to survive in an industry that is under financial pressure while transitioning to digital.

With that in mind, we used the Trends endpoint to find out which publishers were leading the way in video content. You can see pretty clearly from the graph below that the Daily Mail dominates last month’s video content. To see the rest of the publishers more clearly, you can select the Daily Mail bubble below and click “exclude”.

The Daily Mail obviously dominate the chart here, which isn’t too surprising when you consider that they feature video as a central part of the content on their site. They produce a huge amount of stories every month, and feature video even when it wasn’t completely related to the story it appeared with. Although the discontinuity can seem odd, even a loosely-related video can increase click through rate and revenues.

As you can see, many traditional news publishers are lagging behind in terms of the amount of video they’re publishing, with The Guardian, Forbes, ABC, and The Daily Mail among the few recognizable print and television giants on the graph. Instead, the field is largely made up of publishers like The Elite Daily, Uproxx, and Heavy, digital native organizations who are publishing more online video content than most traditional publishers.

Well, that concludes our brief analysis of last month’s video content in news stories. If you’re an AYLIEN subscriber, we’d like to remind you that the two endpoints we used in this post (Trends and Time Series) do not return stories, so you can hit them as much as you like and they won’t contribute towards your monthly 10,000 stories. So dig in!

If you’re not a subscriber, you can try the News API free of charge for two weeks by clicking on the image below (free means free, there’s no card required or obligation to buy).

In Machine Learning, the traditional assumption is that the data our model is applied to is the same as the data we used for training. This assumption is proven false as soon as we move into the real world: many of the data sources we encounter will be very different than our original training data (same meaning here that it comes from the same distribution). In practice, this causes the performance of our model to deteriorate significantly.

Domain adaptation is a prominent approach to transfer learning that can help to bridge this discrepancy between the training and test data. Domain adaptation methods typically seek to identify features that are shared between the domains or learn representations that are general enough to be useful for both domains. In this blog post, I will discuss the motivation for, and the findings of the recent paper that I published with Barbara Planck. In it, we outline a complementary approach to domain adaptation – rather than learning a model that can adapt between the domains, we will learn to select data that is useful for training our model.

### Preventing Negative Transfer

The main motivation behind selecting data for transfer learning is to prevent negative transfer. Negative transfer occurs if the information from our source training data is not only unhelpful but actually counter-productive for doing well on our target domain. The classic example for negative transfer comes from sentiment analysis: if we train a model to predict the sentiment of book reviews, we can expect the model to do well on domains that are similar to book reviews. Transferring a model trained on book reviews to reviews of electronics, however, results in negative transfer, as many of the terms our model learned to associate with a certain sentiment for books, e.g. “page-turner”, “gripping”, or — worse — “dangerous” and “electrifying”, will be meaningless or have different connotations for electronics reviews.

In the classic scenario of adapting from one source to one target domain, the only thing we can do about this is to create a model that is capable of disentangling these shifts in meaning. However, adapting between two very dissimilar domains still fails often or leads to painfully poor performance.

In the real world, we typically have access to multiple data sources. In this case, we can only train our model on the data that is most helpful for our target domain. It is unclear, however, what the best way to determine the helpfulness of source data with respect to a target domain is. Existing work generally relies on measures of similarity between the source and the target domain.

### Bayesian Optimization for Data Selection

Our hypothesis is that the best way to select training data for transfer learning depends on the task and the target domain. In addition, while existing measures only consider data in relation to the target domain, we also argue that some training examples are inherently more helpful than others.

For these reasons, we propose to learn a data selection measure for transfer learning. We do this using Bayesian Optimization, a framework that has been used successfully to optimize hyperparameters in neural networks and which can be used to optimize any black-box function. We learn this function by defining several features relating to the similarity of the training data to the target domain as well as to its diversity. Over the course of several iterations, the data selection model then learns the importance of each of those features for the relevant task.

### Evaluation & Conclusion

We evaluate our approach on three tasks, sentiment analysis, part-of-speech tagging, and dependency parsing and compare our approach to random selection as well as existing methods that select either the most similar source domain or the most similar training examples.

For sentiment analysis on reviews, training on the most similar domain is a strong baseline as review categories are clearly delimited. We significantly improve upon this baseline and demonstrate that diversity complements similarity. We even achieve performance competitive with a state-of-the-art domain adaptation approach, despite not performing any adaptation.

We observe smaller, but consistent improvements for part-of-speech tagging and dependency parsing. Lastly, we evaluate how well learned measures transfer across models, tasks, and domains. We find that learning a data selection measure can be learned with a simpler model, which is used as a proxy for a state-of-the-art model. Transfer across domains is robust, while transfer across tasks holds — as one would expect — for related tasks such as POS tagging and parsing, but fails for dissimilar tasks, e.g. parsing and sentiment analysis.

In the paper, we demonstrate the importance of selecting relevant data for transfer learning. We show that taking into account task and domain-specific characteristics and learning an appropriate data selection measure outperforms off-the-shelf metrics. We find that diversity complements similarity in selecting appropriate training data and that learned measures can be transferred robustly across models, domains, and tasks.

This work will be presented at the 2017 Conference on Empirical Methods in Natural Language Processing. More details can be found in the paper here.

Every day, we generate huge amounts of text online, creating vast quantities of data about what is happening in the world and what people think. All of this text data is an invaluable resource that can be mined in order to generate meaningful business insights for analysts and organizations. However, analyzing all of this content isn’t easy, since converting text produced by people into structured information to analyze with a machine is a complex task. In recent years though, Natural Language Processing and Text Mining has become a lot more accessible for data scientists, analysts, and developers alike.

There is a massive amount of resources, code libraries, services, and APIs out there which can all help you embark on your first NLP project. For this how-to post, we thought we’d put together a three-step, end-to-end guide to your first introductory NLP project. We’ll start from scratch by showing you how to build a corpus of language data and how to analyze this text, and then we’ll finish by visualizing the results.

We’ve split this post into 3 steps. Each of these steps will do two things: show a core task that will get you familiar with NLP basics, and also introduce you to some common APIs and code libraries for each of the tasks. The tasks we’ve selected are:

1. Building a corpus — using Tweepy to gather sample text data from Twitter’s API.
2. Analyzing text — analyzing the sentiment of a piece of text with our own SDK.
3. Visualizing results — how to use Pandas and matplotlib to see the results of your work.

Please note: This guide is aimed at developers who are new to NLP and anyone with a basic knowledge of how to run a script in Python. If you don’t want to write code, take a look at the blog posts we’ve put together on how to use our RapidMiner extension or our Google Sheets Add-on to analyze text.

## Step 1. Build a Corpus

You can build your corpus from anywhere — maybe you have a large collection of emails you want to analyze, a collection of customer feedback in NPS surveys that you want to dive into, or maybe you want to focus on the voice of your customers online. There are lots of options open to you, but for the purpose of this post we’re going to use Twitter as our focus for building a corpus. Twitter is a very useful source of textual content: it’s easily accessible, it’s public, and it offers an insight into a huge volume of text that contains public opinion.

Accessing the Twitter Search API using Python is pretty easy. There are lots of libraries available, but our favourite option is Tweepy. In this step, we’re going to use Tweepy to ask the Twitter API for 500 of the most recent Tweets that contain our search term, and then we’ll write the Tweets to a text file, with each Tweet on its own line. This will make it easy for us to analyze each Tweet separately in the next step.

You can install Tweepy using pip:

pip install tweepy

Once completed, open a Python shell to double-check that it’s been installed correctly:

>>> import tweepy


## import the libraries
import tweepy, codecs

consumer_key = ‘your consumer key here’
consumer_secret = ‘your consumer secret key here’

## let Tweepy set up an instance of the REST API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

## fill in your search query and store your results in a variable
results = api.search(q = "your search term here", lang = "en", result_type = "recent", count = 1000)

## use the codecs library to write the text of the Tweets to a .txt file
file = codecs.open("your text file name here.txt", "w", "utf-8")
for result in results:
file.write(result.text)
file.write("\n")
file.close()


You can see in the script that we are writing result.text to a .txt file and not simply the result, which is what the API is returning to us. APIs that return language data from social media or online journalism sites usually return lots of metadata along with your results. To do this, they format their output in JSON, which is easy for machines to read.

For example, in the script above, every “result” is its own JSON object, with “text” being just one field — the one that contains the Tweet text. Other fields in the JSON file contain metadata like the location or timestamp of the Tweet, which you can extract for a more detailed analysis.

To access the rest of the metadata, we’d need to write to a JSON file, but for this project we’re just going to analyze the text of people’s Tweets. So in this case, a .txt file is fine, and our script will just forget the rest of the metadata once it finishes. If you want to take a look at the full JSON results, print everything the API returns to you instead:

This is also why we used codecs module, to avoid any formatting issues when the script reads the JSON results and writes utf-8 text.

## Step 2. Analyze Sentiment

So once we’ve collected the text of the Tweets that you want to analyze, we can use more advanced NLP tools to start extracting information from it. Sentiment analysis is a great example of this, since it tells us whether people were expressing positive, negative, or neutral sentiment in the text that we have.

For sentiment analysis, we’re going to use our own AYLIEN Text API. Just like with the Twitter Search API, you’ll need to sign up for the free plan to grab your API key (don’t worry — free means free permanently. There’s no credit card required, and we don’t harass you with promotional stuff!). This plan gives you 1,000 calls to the API per month free of charge.

Again, you can install using pip:


pip install aylien-apiclient


Then make sure the SDK has installed correctly from your Python shell:


>>>from aylienapiclient import textapi


Once you’ve got your App key and Application ID, insert them into the code below to get started with your first call to the API from the Python shell (we also have extensive documentation in 7 popular languages). Our API lets you make your first call to the API with just four lines of code:


>>>from aylienapiclient import textapi
>>>client = (‘Your_app_ID’, ‘Your_application_key’)
>>>sentiment = client.Sentiment({'text': 'enter some of your own text here'})
>>>print(sentiment)


This will return JSON results to you with metadata, just like our results from the Twitter API.

So now we need to analyze our corpus from step 1. To do this, we need to analyze every Tweet separately. The script below uses the io module to open up a new .csv file and write the column headers “Tweet” and “Sentiment”, and then it opens and reads the .txt file containing our Tweets. Then, for each Tweet in the .txt file it sends the text to the AYLIEN API, extracts the sentiment prediction from the JSON that the AYLIEN API returns, and writes this to the .csv file beside the Tweet itself.

This will give us a .csv file with two columns — the text of a Tweet and the sentiment of the Tweet, as predicted by the AYLIEN API. We can look through this file to verify the results, and also visualize our results to see some metrics on how people felt about whatever our search query was.


from aylienapiclient import textapi
import csv, io

## Initialize a new client of AYLIEN Text API
client = textapi.Client("your_app_ID", "your_app_key")

with io.open('Trump_Tweets.csv', 'w', encoding='utf8', newline='') as csvfile:
csv_writer = csv.writer(csvfile)
csv_writer("Tweet", " Sentiment")
with io.open("Trump.txt", 'r', encoding='utf8') as f:
## Remove extra spaces or newlines around the text
tweet = tweet.strip()

## Reject tweets which are empty so you don’t waste your API credits
if len(tweet) == 0:
print('skipped')
continue

print(tweet)

## Make call to AYLIEN Text API
sentiment = client.Sentiment({'text': tweet})

## Write the sentiment result into csv file
csv_writer.writerow([sentiment['text'], sentiment['polarity']])


You might notice on the final line of the script that when the script goes to write the Tweet text to the file, we’re actually writing the Tweet as it is returned by the AYLIEN API, rather than the Tweet from the .txt file. They are both identical pieces of text, but we’ve chosen to write the text from the API just to make sure we’re reading the exact text that the API analyzed. This is just to make it clearer if we’ve made an error somehow.

## Step 3. Visualize your Results

So far we’ve used an API to gather text from Twitter, and used our Text Analysis API to analyze whether people were speaking positively or negatively in their Tweet. At this point, you have a couple of options with what you do with the results. You can feed this structured information about sentiment into whatever solution you’re building, which could be anything from a simple social listening app or a even an automated report on the public reaction to a campaign. You could also use the data to build informative visualizations, which is what we’ll do in this final step.

For this step, we’re going to use matplotlib to visualize our data and Pandas to read the .csv file, two Python libraries that are easy to get up and running. You’ll be able to create a visualization from the command line or save it as a .png file.

Install both using pip:


pip install matplotlib
pip install pandas


The script below opens up our .csv file, and then uses Pandas to read the column titled “Sentiment”. It uses Counter to count how many times each sentiment appears, and then matplotlib plots Counter’s results to a color-coded pie chart (you’ll need to enter your search query to the “yourtext” variable for presentation reasons).


## import the libraries
import matplotlib.pyplot as plt
import pandas as pd
from collections import Counter
import csv

## open up your csv file with the sentiment results
with open('your_csv_file_from_step_3', 'r', encoding = 'utf8') as csvfile:
## use Pandas to read the “Sentiment” column,
sent = df["Sentiment"]

## use Counter to count how many times each sentiment appears
## and save each as a variable
counter = Counter(sent)
positive = counter['positive']
negative = counter['negative']
neutral = counter['neutral']

## declare the variables for the pie chart, using the Counter variables for “sizes”
labels = 'Positive', 'Negative', 'Neutral'
sizes = [positive, negative, neutral]
colors = ['green', 'red', 'grey']
yourtext = "Your Search Query from Step 2"

## use matplotlib to plot the chart
plt.pie(sizes, labels = labels, colors = colors, shadow = True, startangle = 90)
plt.title("Sentiment of 200 Tweets about "+yourtext)
plt.show()


If you want to save your chart to a .png file instead of just showing it, replace plt.show on the last line with savefig(‘your chart name.png’). Below is the visualization we ended up with (we searched “Trump” in step 1).

If you run into any issues with these scripts, big or small, please leave a comment below and we’ll look into it. We always try to anticipate any problems our own users might run into, so be sure to let us know!

That concludes our introductory Text Mining project with Python. We hope it gets you up and running with the libraries and APIs, and that it gives you some ideas about subjects that would interest you. With the world producing content on such a large scale, the only obstacle holding you back from an interesting project is your own imagination!

Happy coding!

Chatbots are a hot topic in tech at the moment. They’re at the center of a shift in how we communicate, so much so that they are central to the strategy and direction of major tech companies like Microsoft and Facebook. According to Satya Nadella, CEO of Microsoft, “Chatbots are the new apps”.

## So why exactly have chatbots become so popular?

Their rise in popularity is partly connected to the resurgence of AI and its applications in industry, but it’s also down to our insatiable appetite for on-demand service and our shift to messaging apps over email and phone. A recent study found that 44% of US consumers would prefer to use chatbots over humans for customer relations, and 61% of those surveyed said they interact with a chatbot at least once a month. This is because they suit today’s consumers’ needs – they can respond to customer queries instantly, day or night.

Large brands and tech companies have recognised this shift in customer needs and now rely on messenger and intelligent assistants to provide a better experience for their customers. This is especially true since

So while the adoption of intelligent assistants and chatbots is growing at a colossal rate, contrary to popular belief and media hype, they’re actually nothing new. We’ve had them for over fifty years in the Natural Language Processing community and they’re a great example of the core mission of NLP  – programming computers to understand how humans communicate.

In this blog, we’re going to show three different chatbots and let you interact with each bot so you can see how they have advanced. We’ll give some slightly technical explanations of how each chatbot works so you can see how NLP works under the hood.

## The Chatbots

1. ELIZA – a chatbot from 1966 that was the first well-known chatbot in the NLP community
2. ALICE – a chatbot from the late 1990s that inspired the movie Her
3. Neuralconvo – a Deep Learning chatbot from 2016 that learned to speak from movie scripts

We should mention here that these three bots are all “chit-chat” bots, as opposed to “task-oriented” bots. Whereas task-oriented bots are built for a specific use like checking if an item is in stock or ordering a pizza, a chit-chat bot has no function other than imitating a real person for you to chat with. By seeing how chit-chat bots have advanced, you’re going to see how the NLP community has used different methods to replicate human communication.

### ELIZA – A psychotherapy bot

The first version of ELIZA was finished in 1966 by Joseph Weizenbaum, a brilliant, eccentric MIT professor considered one of the fathers of AI (and who is the subject of a great documentary). ELIZA emulates a psychotherapist, one that Weizenbaum’s colleagues trusted enough to divulge highly personal information, even after they knew it was a computer program. Weizenbaum was so shocked at how his colleagues thought ELIZA could help them, even after they knew it was a computer program, that he spent the rest of his life advocating for social responsibility in AI.

But ELIZA only emulates a psychotherapist because it uses clever ways to return your text as a question, just like a real psychotherapist would. This clever tactic means ELIZA can respond to a question that it doesn’t understand with a relatively simple process of rephrasing the input as a question, so the user is kept in conversation.

Just like any algorithm, chatbots work from rules that tell it how to take an input and produce an output. In the case of chatbots, the input is text you supply to it, and the output is text it returns back to you as a response. Looking at the responses you get from ELIZA, you’ll see two rough categories of rules:

• on a syntactic level, it transfers personal pronouns (“my” to “your,” and vice versa).
• to imitate semantic understanding (ie that it understands the meaning of what you are typing), it has been programmed to recognize certain keywords and returns phrases that have been marked as suitable returns to the input. For instance, if you input “I want to ___” it will return “What would it mean to you if you ___”.

Try and figure out some of ELIZA’s limits for yourself by asking it questions and trying to figure out why it’s returning each of its responses. Remember: it’s from the 1960s, when color televisions were the height of consumer technology.

This is a pure Natural Language Processing approach to building a chatbot: the bot understands human language by the rules mentioned above, which are basically grammar rules programmed into a computer. This achieves impressive results, but if you wanted to make ELIZA more human-like by pure NLP methods you would have to add more and more grammatical rules, and because grammar is complicated and contradictory, you would quickly end up with a sort of “rule spaghetti,” which wouldn’t work. This approach is in contrast with Machine Learning approaches to chatbots (and natural language in general), where an algorithm will try to guess the correct response based on observations it has made on other conversations. You can see this in action in the final chatbot, Neuralconvo. But first, ALICE.

## ALICE – The star of the movie Her

Fast forward from the 1960s to the late 1990s and you meet ALICE, the first well-known chatbot that people could interact with online, and one that developed something of a cult reputation. Director Spike Jonze said that chatting with ALICE in the early 2000s first put the idea for his 2013 film Her in his mind, a movie where a man falls in love with the AI that powers his operating system.

But just like ELIZA, this is a computer program made up of rules that take an input and produce an output. Under the hood, ALICE is an advance on ELIZA in three respects:

• it is written in a programming language called Artificial Intelligence Markup Language (AIML), similar to XML, which allows it to choose responses on a more abstract level
• it contains tens of thousands of possible responses
• it stores previous conversations with users and adds them to its database.

ALICE is an open source bot, one that anyone can download and modify or contribute to. Written originally by Dr. Richard Wallace, over 500 volunteers have contributed to the bot, creating 100,000s of lines of AIML for ALICE to reproduce in conversation.

So ALICE’s improvements on ELIZA allow for more responses that are better tailored to the text you are supplying it with. This allows ALICE to impersonate a person in general, rather than a therapist specifically. The problem here is that the shortcomings are now more obvious – without open ended statements and questions, the lack of a response that matches your input is more obvious. Explore this for yourself below.

So even though ALICE is a more advanced chatbot than ELIZA, the output responses are still written by people, and algorithms choose which output best suits the input. Essentially, people type out the responses and write the algorithms that choose which of these responses will be returned in the hope of mimicking an actual conversation.

Improving the performance and intelligence of chatbots is a popular research area and much of the recent interest in advancing chatbots has been around Deep Learning. Applying Deep Learning to chatbots seems likely to massively improve a chatbot’s ability to interact more like a human. Whereas ELIZA and ALICE reproduce text that was originally written by a person, a Deep Learning bot creates its own text from scratch, based on human speech it has analyzed.

## Neuralconvo – A Deep Learning bot

One such bot is Neuralconvo, a modern chatbot created in 2016 by Julien Chaumond and Clément Delangue, co-founders of Huggingface, which was trained using Deep Learning. Deep Learning is a method of training computers to learn patterns in data by using deep neural networks. It is enabling huge breakthroughs in computer science, particularly in AI, and more recently NLP. When applied to chatbots, Deep Learning allows programs to select a response or even to generate entirely new text.

Neuralconvo can come up with its own text because it has “learned” by reading thousands of movie scripts and recognizing patterns in the text. So when Neuralconvo reads a sentence it recognizes patterns in your text, refers back to its training to look for similar patterns, and then generates a new sentence for you that it thinks would follow your sentence if it were in the movie scripts in a conversational manner. It’s basically trying to be cool based on movies it’s seen.

The fundamental difference between ELIZA and Neuralconvo is this: whereas ELIZA was programmed to respond to specific keywords in your input with specific responses, Neuralconvo is making guesses based on probabilities it has observed in movie scripts. So there are no rules telling Neuralconvo to respond to a question a certain way, for example, but the possibilities of its answers are limitless.

Considering Neuralconvo is trained on movie scripts, you’ll see that its responses are suitably dramatic.

The exact model that is working under the hood here is based on the Sequence to Sequence architecture, which was first applied to generate dialogue by Quoc Viet Le and Oriol Vinyals. This architecture consists of two parts: the first one encodes your sentence into a vector, which is basically a code that represents the text. After the entire input text has been encoded this way, the second part then decodes that vector and produces the answer word-by-word by predicting each word that is most likely to come next.

Neuralconvo isn’t going to fool you into thinking that it is a person anytime soon, since it is just a demo of a bot trained on movie scripts. But imagine how effective a bot like this could be when trained using context-specific data, like your own SMS or WhatsApp messages. That’s what’s on the horizon for chatbots, but remember – they will still be algorithms taking your text as input, referring to rules, and returning different text as an output.

Well that sums up our lightning tour of chatbots from the 1960s to today. If you’re interested in blogs about technical topics like training AI to play Flappy Bird or why you should open-source your code, take a look at the Research section of our blog, where our research scientists and engineers post about what interests them.

Extracting insights from millions of articles at once can create a lot of value, since it lets us understand what information thousands of journalists are producing about what’s happening in the world. But extracting accurate insights depends on filtering out noise and finding relevant content. To allow our users access to relevant content, our News API analyzes thousands of news articles in near real-time and categorizes them according to what content is about.

Having content at web-scale arranged into categories provides accurate information about what the media are publishing as the stories emerge. This allows us to do two things, depending on what we want to use the API for: we can either look at a broad picture of what is being covered in the press, or we can carry out a detailed analysis of the coverage about a specific industry, organization, or event.

For this month’s roundup, we decided to do both. First we’re going to take a look at what news categories the media covered the most to see what the content is about in the most written-about categories, and then we’ll pick one category for a more detailed look. First we’ll take a high-level look at sports content, because it’s what the world’s media wrote the most about, and then we’ll dive into stories about finance, to see what insights the News API can produce for us in a business field.

## The 100 categories with the highest volume of stories

The range of the subject matter contained in content published every day is staggering, which makes understanding all of this content at scale particularly difficult. However, the ability to classify new content based on well known, industry-standard taxonomies means it can be easily categorized and understood.

Our News API categorizes every article it analyzes according to two taxonomies: Interactive Advertising Bureau’s QAG taxonomy and IPTC’s Newscodes. We chose to use the IAB-QAG taxonomy, which contains just under 400 categories and subcategories, and decided to look into the top 100 categories and subcategories that the media published the most about in June. This left us with just over 1.75 million of the stories that our News API has gathered and analyzed.

Take a look at the most popular ones in the visualization below.

Note: you can interact with all of the visualizations on this blog – click on each data point for more information, and exclude the larger data points if you want to see more detail on the smaller ones.

As you can see, stories about sport accounted for the most stories published in June. It might not surprise people to see that the media publish a lot about sport, but the details you can pick out here are pretty interesting – like the fact that there were more stories about soccer than food, religion, or fashion last month.

The chart below puts the volume of stories about sports into perspective – news outlets published almost 13 times more stories about sports than they did about music.

## What people wrote about sports

Knowing that people wrote so much about sport is great, but we still don’t know what people were talking about in all of this content. To find this out, we decided to dive into the stories about sports and see what the content was about – take a look at the chart below showing the most-mentioned sports sub-categories last month.

In this blog we’re only looking into stories that were in the top 100 sub-categories overall, so if your favourite sport isn’t listed below, that means it wasn’t popular enough and you’ll need to query our API for yourself to look into it (sorry, shovel racers).

You can see how soccer dominates the content about sport, even though it’s off-season for every major soccer league. To put this volume in perspective, there were more stories published about soccer than about baseball and basketball combined. Bear in mind, last month saw the MLB Draft and the NBA finals, so it wasn’t exactly a quiet month for either of these sports.

We then analyzed the stories about soccer with the News API’s entities feature to see what people, countries, and organisations people were talking about.

If you check the soccer schedules for June, you’ll see the Confederations Cup is the only major tournament taking place, which is a competition between international teams. However you can see above that the soccer coverage was still dominated by stories about the clubs with the largest fan bases. The most-mentioned clubs above also top the table in a Forbes analysis  f clubs with the greatest social media reach among fans.

# Finance

So we’ve just taken a look at what people and organizations dominated the coverage in the news categories that the media published the most in. But even though the sports category is the single most popular one, online content is so wide-ranging that sports barely accounted for 10% of the 1.75 million stories our News API crawled last month.

We thought it would be interesting to show you how to use the API to look into business fields and spot a high-level trend in the news content last month. Using the same analysis that we used on sports stories above, we decided to look at stories about finance. Below is a graph of the most-mentioned entities in stories published in June that fell into the finance category.

You can see that the US and American institutions dominate the coverage of the financial news. This is hardly surprising, considering America’s role as the main financial powerhouse in the world. But what sticks out a little here is that the Yen is the only currency entity mentioned, even though Japan isn’t mentioned as much as other countries.

To find out what kind of coverage the Yen was garnering last month, we analyzed the sentiment of the stories with “Yen” in the title to see how many contained positive, negative, or neutral sentiment.

We can see that there is much more negative coverage here than positive coverage, so we can presume that Japan’s currency had some bad news last month, but that leaves with a big question: why was there so much negative press about the Yen last month?

To find out, we used the keywords feature. Analyzing the keywords in stories returns more detailed information than the entities endpoint we used on the soccer content above, so it is best used when you’re diving into a specific topic rather than getting an overview of some news content, since you’ll get a lot of noise then. It is more detailed because whereas the entities feature returns accurate information about the places, people, and organisations mentioned in stories, the keywords feature will also include the most important nouns and verbs in these stories. This means that we can see a more detailed picture of the things that happened.

Take a look below at the most-mentioned keywords from stories that were talking about the Yen last month.

You can see that the keywords feature returns a different kind of result than entities – words like “year,” and “week,” and “investor,” for example. If we looked at the keywords from all of the news content published in June, it would be hard to get insights because the keywords would be so general. But since we’re diving into a defined topic, we can extract some detailed insights about what actually happened.

Looking at the chart above you can probably guess for yourself what the main stories about the Yen last month involved. We can see from the fact that the most-mentioned terms above that keywords like “data,’ “growth,” “GDP,” and “economy” that Japan has had some negative data about economic growth, which explains the high volume of negative stories about the Yen. You can see below how the value of the Yen started a sustained drop in value after June 15th, the day this economic data was announced, and our News API has tracked the continued negative sentiment.

These are just a couple of examples of steps our users take to automatically extract insights from content on subjects that interest them, whether it is for media monitoring, content aggregation, or any of the thousands of use cases our News API facilitates.

If you can think of any categories you’d like to extract information from using the News API, sign up for a free 14-day trial by clicking on the link below (free means free – you don’t need a credit card and there’s no obligation to purchase).

In the last post we looked at how Generative Adversarial Networks could be used to learn representations of documents in an unsupervised manner. In evaluation, we found that although the model was able to learn useful representations, it did not perform as well as an older model called DocNADE. In this post we give a brief overview of the DocNADE model, and provide a TensorFlow implementation.

## Neural Autoregressive Distribution Estimation

Recent advances in neural autoregressive generative modeling has lead to impressive results at modeling images and audio, as well as language modeling and machine translation. This post looks at a slightly older take on neural autoregressive models – the Neural Autoregressive Distribution Estimator (NADE) family of models.

An autoregressive model is based on the fact that any D-dimensional distribution can be factored into a product of conditional distributions in any order:

$p(\mathbf{x}) = \prod_{d=1}^{D} p(x_d | \mathbf{x}_{<d})$

where $\mathbf{x}_{<d}$ represents the first $d-1$ dimensions of $\mathbf{x}$ in the current ordering. We can therefore create an autoregressive generative model by just parameterising all of the separate conditionals in this equation.

One of the more simple ways to do this is to take a sequence of binary values, and assume that the output at each timestep is just a linear combination of the previous values. We can then pass this weighted sum through a sigmoid to get the output probability for each timestep. This sort of model is called a fully-visible sigmoid belief network (FVSBN):

A fully visible sigmoid belief network. Figure taken from the NADE paper.

Here we have binary inputs $v$ and generated binary outputs $\hat{v}$. $\hat{v_3}$ is produced from the inputs $v_1$ and $v_2$.

NADE can be seen as an extension of this, where instead of a linear parameterisation of each conditional, we pass the inputs through a feed-forward neural network:

Neural Autoregressive Distribution Estimator. Figure taken from the NADE paper.

Specifically, each conditional is parameterised as:

$p(x_d | \mathbf{x_{<d}}) = \text{sigm}(b_d + \mathbf{V}_{d,:} \mathbf{h}_d)$

$\mathbf{h}_d = \text{sigm}(c + \mathbf{W}_{:,<d} \mathbf{x}_{<d})$

where $\mathbf{W}$, $\mathbf{V}$, $b$ and $c$ are learnable parameters of the model. This can then be trained by minimising the negative log-likelihood of the data.

When compared to the FVSBN there is also additional weight sharing in the input layer of NADE: each input element uses the same parameters when computing the various output elements. This parameter sharing was inspired by the Restricted Boltzmann Machine, but also has some computational benefits – at each timestep we only need to compute the contribution of the new sequence element (we don’t need to recompute all of the preceding elements).

In the standard NADE model, the input and outputs are binary variables. In order to work with sequences of text, the DocNADE model extends NADE by considering each element in the input sequence to be a multinomial observation – or in other words one of a predefined set of tokens (from a fixed vocabulary). Likewise, the output must now also be multinomial, and so a softmax layer is used at the output instead of a sigmoid. The DocNADE conditionals are then given by:

$p(x | \mathbf{x_{<d}}) = \frac{\text{exp} (b_{w_d} + \mathbf{V}_{w_d,:} \mathbf{h}_d) } {\sum_w \text{exp} (b_w + \mathbf{V}_{w,:} \mathbf{h}_d) }$

$\mathbf{h}_d = \text{sigm}\Big(c + \sum_{k<d} \mathbf{W}_{:,x_k} \Big)$

An additional type of parameter sharing has been introduced in the input layer – each element will have the same weights no matter where it appears in the sequence (so if the word “cat” appears input positions 2 and 10, it will use the same weights each time).

There is another way to look at this however. We now have a single set of parameters for each word no matter where it appears in the sequence, and there is a common name for this architectural pattern – a word embedding. So we can view DocNADE a way of constructing word embeddings, but with a different set of constraints than we might be used to from models like Word2Vec. For each input in the sequence, DocNADE uses the sum of the embeddings from the previous timesteps (passed through a sigmoid nonlinearity) to predict the word at the next timestep. The final representation of a document is just the value of the hidden layer at the final timestep (or in the other words, the sum of the word embeddings passed through a nonlinearity).

There is one more constraint that we have not yet discussed – the sequence order. Instead of training on sequences of words in the order that they appear in the document, as we do when training a language model for example, DocNADE trains on random permutations of the words in a document. We therefore get embeddings that are useful for predicting what words we expect to see appearing together in a full document, rather than focusing on patterns that arise due to syntax and word order (or focusing on smaller contexts around each word).

## An Overview of the TensorFlow code

The full source code for our TensorFlow implementation of DocNADE is available on Github, here we will just highlight some of the more interesting parts.

First we do an embedding lookup for each word in our input sequence (x). We initialise the embeddings to be uniform in the range [0, 1.0 / (vocab_size * hidden_size)], which is taken from the original DocNADE source code. I don’t think that this is mentioned anywhere else, but we did notice a slight performance bump when using this instead of the default TensorFlow initialisation.


with tf.device('/cpu:0'):
max_embed_init = 1.0 / (params.vocab_size * params.hidden_size)
W = tf.get_variable(
'embedding',
[params.vocab_size, params.hidden_size],
initializer=tf.random_uniform_initializer(maxval=max_embed_init)
)
self.embeddings = tf.nn.embedding_lookup(W, x)


Next we compute the pre-activation for each input element in our sequence. We transpose the embedding sequence so that the sequence length elements are now the first dimension (instead of the batch), then we use the higher-order tf.scan function to apply sum_embeddings to each sequence element in turn. This replaces each embedding with sum of that embedding and the previously summed embeddings.


def sum_embeddings(previous, current):
return previous + current

h = tf.scan(sum_embeddings, tf.transpose(self.embeddings, [1, 2, 0]))
h = tf.transpose(h, [2, 0, 1])

h = tf.concat([
tf.zeros([batch_size, 1, params.hidden_size], dtype=tf.float32), h
], axis=1)

h = h[:, :-1, :]


We then initialise the bias terms, prepend a zero vector to the input sequence (so that the first element is generated from just the bias term), and apply the nonlinearity.


bias = tf.get_variable(
'bias',
[params.hidden_size],
initializer=tf.constant_initializer(0)
)
h = tf.tanh(h + bias)


Finally we compute the sequence loss, which is masked according to the length of each sequence in the batch. Note that for optimisation, we do not normalise this loss by the length of each document. This leads to slightly better results as mentioned in the paper, particularly for the document retrieval evaluation (discussed below).


h = tf.reshape(h, [-1, params.hidden_size])
logits = linear(h, params.vocab_size, 'softmax')


## Experiments

As DocNADE computes the probability of the input sequence, we can measure how well it is able to generalise by computing the probability of a held-out test set. In the paper the actual metric that they use is the average perplexity per word, which for time $t$, input $x$ and test set size $N$ is given by:

$\text{exp} \big(-\frac{1}{N} \sum_{t} \frac{1}{|x_t|} \log p(x_t) \big)$

As in the paper, we evaluate DocNADE on the same (small) 20 Newsgroups dataset that we used in our previous post, which consists of a collection of around 19000 postings to 20 different newsgroups. The published version of DocNADE uses a hierarchical softmax on this dataset, despite the fact that they use a small vocabulary size of 2000. There is not much need to approximate a softmax of this size when training relatively small models on modern GPUs, so here we just use a full softmax. This makes a large difference in the reported perplexity numbers – the published implementation achieves a test perplexity of 896, but with the full softmax we can get this down to 579. To note how big an improvement this is, the following table shows perplexity values on this task for models that have been published much more recently:

One additional change from the evaluation in the paper is that we evaluate the average perplexity over the full test set (in the paper they just take a random sample of 50 documents).

We were expecting to see an improvement due to the use of the full softmax, but not an improvement of quite this magnitude. Even when using a sampled softmax on this task instead of the full softmax, we see some big improvements over the published results. This suggests that the hierarchical softmax formulation that was used in the original paper was a relatively poor approximation of the true softmax (but it’s possible that there is a bug somewhere in our implementation, if you find any issues please let us know).

We also see an improvement on the document retrieval evaluation results with the full softmax:

For the retrieval evaluation, we first create vectors for every document in the dataset. We then use the held-out test set vectors as “queries”, and for each query we find the closest N documents in the training set (by cosine similarity). We then measure what percentage of these retrieved training documents have the same newsgroup label as the query document. We then plot a curve of the retrieval performance for different values of N.

Note: for working with larger vocabularies, the current implementation supports approximating the softmax using the sampled softmax.

## Conclusion

We took another look a DocNADE, noting that it can be viewed as another way to train word embeddings. We also highlighted the potential for large performance boosts with older models due simply to modern computational improvements – in this case because it is no longer necessary to approximate smaller vocabularies. The full source code for the model is available on Github.

I presented some preliminary work on using Generative Adversarial Networks to learn distributed representations of documents at the recent NIPS workshop on Adversarial Training. In this post I provide a brief overview of the paper and walk through some of the code.

## Learning document representations

Representation learning has been a hot topic in recent years, in part driven by the desire to apply the impressive results of Deep Learning models on supervised tasks to the areas of unsupervised learning and transfer learning. There are a large variety of approaches to representation learning in general, but the basic idea is to learn some set of features from data, and then using these features for some other task where you may only have a small number of labelled examples (such as classification). The features are typically learned by trying to predict various properties of the underlying data distribution, or by using the data to solve a separate (possibly unrelated) task for which we do have a large number of labelled examples.

This ability to do this is desirable for several reasons. In many domains there may be an abundance of unlabelled data available to us, while supervised data is difficult and/or costly to acquire. Some people also feel that we will never be able to build more generally intelligent machines using purely supervised learning (a viewpoint that is illustrated by the now infamous LeCun cake slide).

Word (and character) embeddings have become a standard component of Deep Learning models for natural language processing, but there is less consensus around how to learn representations of sentences or entire documents. One of the most established techniques for learning unsupervised document representations from the literature is Latent Dirichlet Allocation (LDA). Later neural approaches to modeling documents have been shown to outperform LDA when evaluated on small news corpus (discussed below). The first of these is the Replicated Softmax, which is based on the Restricted Boltzmann Machine, and then later this was also surpassed by a neural autoregressive model called DocNADE.

In addition to autoregressive models like the NADE family, there are two other popular approaches to building generative models at the moment – Variational Autoencoders and Generative Adversarial Networks (GANs). This work is an early exploration to see if GANs can be used to learn document representations in an unsupervised setting.

## Modeling Documents with Generative Adversarial Networks

In the original GAN setup, a generator network learns to map samples from a (typically low-dimensional) noise distribution into the data space, and a second network called the discriminator learns to distinguish between real data samples and fake generated samples. The generator is trained to fool the discriminator, with the intended goal being a state where the generator has learned to create samples that are representative of the underlying data distribution, and the discriminator is unsure whether it is looking at real or fake samples.

There are a couple of questions to address if we want to use this sort of technique to model documents:

• At Aylien we are primarily interested in using the learned representations for new tasks, rather than doing some sort of text generation. Therefore we need some way to map from a document to a latent space. One shortcoming with this GAN approach is that there is no explicit way to do this – you cannot go from the data space back into the low-dimensional latent space. So what is our representation?
• As this requires that both steps are end-to-end differentiable, how do we represent collections of discrete symbols?

To answer the first question, some extensions to standard GAN model train an additional neural network to perform this mapping (like this, this and this).

Another more simple idea is to just use some internal part of the discriminator as the representation (as is done in the DCGAN paper). We experimented with both approaches, but so far have gotten better results with the latter. Specifically, we use a variation on the Energy-based GAN model, where our discriminator is a denoising autoencoder, and use the autoencoder bottleneck as our representation (see the paper for more details).

As for representing discrete symbols, we take the most overly-simplified approach that we can – assume that a document is just a binary vector of bag-of-words (ie, a vector in which there is a 1 if a given word in a fixed vocabulary is present in a document, and a 0 otherwise). Although this is actually still a discrete vector, we can now just treat it as if all elements are continuous in the range [0, 1] and backpropagate through the full network.

The full model looks like this:

Here z is a noise vector, which passes through a generator network G and produces a vector that is the size of the vocabulary. We then pass either this generated vector or a sampled bag-of-words vector from the data (x) to our denoising autoencoder discriminator D. The vector is then corrupted with masking noise C, mapped into a lower-dimensional space by an encoder, mapped back to the data space by a decoder and then finally the loss is taken as the mean-squared error between the input to D and the reconstruction. We can also extract the encoded representation (h) for any input document.

## An overview of the TensorFlow code

The full source for this model can be found at https://github.com/AYLIEN/adversarial-document-model, here will just highlight some of the more important parts.

In TensorFlow, the generator is written as:


def generator(z, size, output_size):
h0 = tf.nn.relu(slim.batch_norm(linear(z, size, 'h0')))
h1 = tf.nn.relu(slim.batch_norm(linear(h0, size, 'h1')))
return tf.nn.sigmoid(linear(h1, output_size, 'h2'))


This function takes parameters containing the noise vector z, the size of the generator’s hidden dimension and the size of the final output dimension. It then passes noise vector through two fully-connected RELU layers (with batch norm), before passing the output through a final sigmoid layer.

The discriminator is similarly straight-forward:


h0 = leaky_relu(linear(noisy_input, size, 'h0'))
h1 = linear(h0, x.get_shape()[1], 'h1')
diff = x - h1
return tf.reduce_mean(tf.reduce_sum(diff * diff, 1)), h0


It takes a vector x, a noise vector mask and the size of the autoencoder bottleneck. The noise is applied to the input vector, before it is passed through a single leaky RELU layer and then mapped linearly back to the input space. It returns both the reconstruction loss and the bottleneck tensor.

The full model is:


with tf.variable_scope('generator'):
self.generator = generator(z, params.g_dim, params.vocab_size)

with tf.variable_scope('discriminator'):
self.d_loss, self.rep = discriminator(x, mask, params.z_dim)

with tf.variable_scope('discriminator', reuse=True):
self.g_loss, _ = discriminator(self.generator, mask, params.z_dim)

margin = params.vocab_size // 20
self.d_loss += tf.maximum(0.0, margin - self.g_loss)

vars = tf.trainable_variables()
self.d_params = [v for v in vars if v.name.startswith('discriminator')]
self.g_params = [v for v in vars if v.name.startswith('generator')]

step = tf.Variable(0, trainable=False)

learning_rate=params.learning_rate,
beta1=0.5
)

learning_rate=params.learning_rate,
beta1=0.5
)



We first create the generator, then two copies of the discriminator network (one taking real samples as input, and one taking generated samples). We then complete the discriminator loss by adding the cost from the generated samples (with an energy margin), and create separate Adam optimisers for the discriminator and generator networks. The magic Adam beta1 value of 0.5 comes from the DCGAN paper, and similarly seems to stabilize training in our model.

This model can then be trained as follows:


def update(model, x, opt, loss, params, session):
z = np.random.normal(0, 1, (params.batch_size, params.z_dim))
mask = np.ones((params.batch_size, params.vocab_size)) * np.random.choice(
2,
params.vocab_size,
p=[params.noise, 1.0 - params.noise]
)
loss, _ = session.run([loss, opt], feed_dict={
model.x: x,
model.z: z,
})
return loss

# … TF training/session boilerplate …

for step in range(params.num_steps + 1):
_, x = next(training_data)

# update discriminator
d_losses.append(update(
model,
x,
model.d_opt,
model.d_loss,
params,
session
))

# update generator
g_losses.append(update(
model,
x,
model.g_opt,
model.g_loss,
params,
session
))



Here we get the next batch of training data, then update the discriminator and generator separately. At each update, we generate a new noise vector to pass to the generator, and a new noise mask for the denoising autoencoder (the same noise mask is used for each input in the batch).

## Experiments

To compare with previous published work in this area (LDA, Replicated Softmax, DocNADE) we ran some experiments with this adversarial model on the 20 Newsgroups dataset. It must be stressed that this is a relatively toy dataset by current standards, consisting of a collection of around 19000 postings to 20 different newsgroups.

One open question with generative models (and GANs in particular) is what metric do you actually use to evaluate how well they are doing? If the model yields a proper joint probability over the input, a popular choice is to evaluate the likelihood of a held-out test set. Unfortunately this is not an option for GAN models.

Instead, as we are only really interested in the usefulness of the learned representation, we also follow previous work and compare how likely similar documents are to have representations that are close together in vector space. Specifically, we create vectors for every document in the dataset. We then use the held-out test set vectors as “queries”, and for each query we find the closest N documents in the training set (by cosine similarity). We then measure what percentage of these retrieved training documents have the same newsgroup label as the query document. We then plot a curve of the retrieval performance for different values of N. The results are shown below.

Precision-recall curves for the document retrieval task on the 20 Newsgroups dataset. ADM is the adversarial document model, ADM (AE) is the adversarial document model with a standard Autoencoder as the discriminator (and so it similar to the Energy-Based GAN), and DAE is a Denoising Autoencoder.

Here we can see a few notable points:

• The model does learn useful representations, but is still not reaching the performance of DocNADE on this task. At lower recall values though it is better than the LDA results on the same task (not shown above, see the Replicated Softmax paper).
• By using a denoising autoencoder as the discriminator, we get a bit of a boost versus just using a standard autoencoder.
• We get quite a large improvement over just training a denoising autoencoder with similar parameters on this dataset.

We also looked at whether the model produced results that were easy to interpret. We note that the autoencoder bottleneck has weights connecting it to every word in the vocabulary, so we looked to see if specific hidden units were strongly connected to groups of words that could be interpreted as newsgroup topics. Interestingly we find some evidence of this as shown in the table below, where we present the words most strongly associated with three of these hidden units. They do generally fit into understandable topic categories, with a few of exceptions. However, we note that these are cherry-picked examples, and that overall the weights for a specific hidden unit do not tend to strongly associate with single topics.

 Computing Sports Religion windows hockey christians pc season windows modem players atheists scsi baseball waco quadra rangers batf floppy braves christ xlib leafs heart vga sale arguments xterm handgun bike shipping bike rangers

We can also see reasonable clustering of many of the topics in a TSNE plot of the test-set vectors (1 colour per topic), although some are clearly still being confused:

## Conclusion

We showed some interesting first steps in using GANs to model documents, admittedly perhaps asking more questions than we answered. In the time since the completion of this work, there have been numerous proposals to improve GAN training (such as this, this and this), so it would be interesting to see if any of the recent advances help with this task. And of course, we still need to see if this approach can be scaled up to larger datasets and vocabularies. The full source code is now available on Github, we look forward to seeing what people do with it.

Every day, over 100,000 flights carry passengers to and from destinations all around the world, and it’s safe to say air travel brings out a fairly mixed bag of emotions in people. Through social media, customers now have a platform to say exactly what’s on their mind while they are traveling, creating a real-time stream of customer opinion on social networks.

If you follow this blog you’ll know that we regularly use Natural Language Processing to get insights into topical subjects ranging from the US Presidential Election to the Super Bowl ad battle. In this post, we thought it would be interesting to collect and analyze Tweets about airlines to see how passengers use Twitter as a platform to voice their opinion. We wanted to compare how often some of the better known airlines are mentioned by travelers on Twitter, what the general sentiment of those mentions were, and and how people’s sentiment varied when they were talking about different aspects of air travel.

#### Collecting Tweets

We chose five airlines, gathered 25,000 of the most recent Tweets mentioning them (from Friday, June 9). We chose the most recent Tweets in order to get a snapshot of what people were talking about in Tweets at any given time.

#### Airlines

The airlines we chose were:

1. American Airlines – the largest American airline
2. Lufthansa – the largest European airline
3. Ryanair – a low-fares giant that is always courting publicity
4. United Airlines – an American giant that is always (inadvertently) courting publicity
5. Aer Lingus – naturally (we’re Irish).

#### Analysis

We’ll cover the following analyses:

• Volume of tweets and mentions
• Document-Level Sentiment Analysis
• Aspect-based Sentiment Analysis

### Sentiment Analysis

Sentiment analysis, also known as opinion mining, allows us to use computers to analyze the sentiment of a piece of text. Essentially analyzing the sentiment of text allows us to get an idea of whether a piece of text is positive, negative or neutral.

For example, below is a chart showing the sentiment of Tweets we gathered that mentioned our target airlines.

This chart shows us a very high-level summary of people’s opinions towards each airline. You can see that the sentiment is generally more negative than positive, particularly in the case of the two US-based carriers, United and American. We can also see that negative Tweets account for a larger share of Ryanair’s Tweets than any other airline. While this gives us a good understanding of the public’s opinion about these certain airlines at the time we collected the tweets, it actually doesn’t tell us much about what exactly people were speaking positively or negatively about.

## Aspect-based Sentiment Analysis digs in deeper

So sentiment analysis can tell us what the sentiment of a piece of text is. But text produced by people usually talks about more than one thing and often has more than one sentiment. For example, someone might write that they didn’t like how a car looked but did like how quiet it was, and a document-level sentiment analysis model would just look at the entire document and add up whether the overall sentiment was mostly positive or negative.

This is where Aspect-based Sentiment Analysis comes in, as it goes one step further and analyzes the sentiment attached to each subject mentioned in a piece of text. This is especially valuable since it allows you to extract richer insights about text that might be a bit complicated.

Here’s an example of our Aspect-based Sentiment Analysis demo analyzing the following piece of text: “This car’s engine is as quiet as hell. But the seats are so uncomfortable!”

It’s clear that Aspect-based Sentiment Analysis can provide more granular insight into the polarity of a piece of text but another problem you’ll come across is context. Words mean different things in different contexts – for instance quietness in a car is a good thing, but in a restaurant it usually isn’t – and computers need help understanding that. With this in mind we’ve tailored our Aspect-based Sentiment Analysis feature to recognize aspects in four industries: restaurants, cars, hotels, and airlines.

So while the example above was analyzing the car domain, below is the result of an analysis of a review of a restaurant, specifically the text “It’s as quiet as hell in this restaurant”:

Even though the text was quite similar to the car review, the model recognized that the words expressed a different sentiment because they were mentioned in a different context.

## Aspect-based Sentiment Analysis in airlines

Now let’s see what we can find in the Tweets we collected about airlines. In the airlines domain, our endpoint recognizes 10 different aspects that people are likely to mention when talking about their experience with airlines.

Before we look at how people felt about each of these aspects, let’s take a look at which aspects they were actually talking about the most.

Noise is a big problem when you’re analyzing social media content. For instance when we analyzed our 25,000 Tweets, we found that almost two thirds had no mention of the aspects we’ve listed above. These Tweets mainly focused on things like online competitions, company marketing material or even jokes about the airlines. When we filtered these noisy Tweets out, we were left with 9,957 Tweets which mentioned one or more aspects.

The chart below shows which of the 10 aspects were mentioned the most.

On one hand it might come as a surprise to see aspects like food and comfort mentioned so infrequently – when you think about people giving out about airlines you tend to think of them complaining about food or the lack of legroom. On the other hand, it’s no real surprise to see aspects like punctuality and staff mentioned so much.

You could speculate that comfort and food are pretty standard across airlines (nobody expects a Michelin-starred airline meal), but punctuality can vary, so people can be let down by this (when your flight is late it’s an unpleasant surprise, which you would be more likely to Tweet about).

## What people thought about each airline on key aspects

Now that we know what people were talking about, let’s take a look at how they felt. We’re going to look at how each airline performed on four interesting aspects:

1. Staff – the most-mentioned aspect;
2. Punctuality – to see which airline receives the best and worst sentiment for delays;
3. Food – infrequently mentioned but a central part of the in-flight experience;
4. Luggage – which airline gets the most Tweets about losing people’s luggage?

### Staff

We saw in the mentions graph above that people mentioned staff the most when tweeting about an airline. You can see from the graph below that people are highly negative about airline staff in general, with a fairly equal level of negativity towards each airline except Lufthansa, which actually receives more positive sentiment than negative.

### Punctuality

People’s second biggest concern was punctuality, and you can see below that the two US-based airlines score particularly bad on this aspect. Also, it’s worth noting that while Ryanair receives very negative sentiment in general, people complain about Ryanair’s punctuality less than any of the other airlines. This isn’t too surprising considering their exemplary punctuality record is one of their major USPs as an airline and something they like to publicize.

### Food

We all know airline food isn’t the best, but when we looked at the sentiment about food in the Tweets, we found that people generally weren’t that vocal about their opinions on plane food. Lufthansa receives the most positive sentiment about this aspect, with their pretty impressive culinary efforts paying off. However it’s an entirely different story when it comes to the customer reaction towards United’s food, none of us have ever flown United here in the AYLIEN office, so from the results we got we’re all wondering what they’re feeding their passengers now.

### Luggage

The last aspect that we compared across the airlines was luggage. When you take a look at the sentiment here, you can see that again Lufthansa perform quite well, but in this one Aer Lingus fares pretty badly. Maybe leave your valuables at home next time you fly with Ireland’s national carrier.

## Ryanair and Lufthansa compared

So far we’ve shown just four of the 10 aspects our Aspect-based Sentiment Analysis feature analyzes in the airlines domain. To show all of them together, we decided to take two very different airlines and put them side by side to see how people’s opinions on each of them compared.

We picked Ryanair and Lufthansa so you can compare a “no frills” budget airline that focuses on short-haul flights, with a more expensive, higher-end offering and see what people Tweet about each.

First, here’s the sentiment that people showed towards every aspect in Tweets that mention Lufthansa.

Below is the same analysis of Tweets that mention Ryanair.

You can see that people express generally more positive sentiment towards Lufthansa than Ryanair.  This is no real surprise since this is a comparison of a budget airline with a higher-end competitor, and you would expect people’s opinions to differ on things like food and flight experience.

But it’s interesting to note the sentiment was actually pretty similar towards the two core aspects of air travel – punctuality and value.

The most obvious outlier here is the overwhelmingly negative sentiment about entertainment on Ryanair flights, especially since there is no entertainment on Ryanair flights. This spike in negativity was due to an incident involving drunk passengers on a Ryanair flight that was covered by the media on the day we gathered our Tweets, skewing the sentiment in the Tweets we collected. These temporary fluctuations are a problem inherent in looking at snapshot-style data samples, but from a voice-of-the-customer point of view they are certainly something an airline needs to be aware of.

This is just one example of how you can use our Text Analysis API to extract meaning from content at a large scale. If you’d like to use AYLIEN to extract insights from any text you have in mind, click on the image at the end of the post to get free access to the API and start analyzing your data. With the extensive documentation and how-to blogs, as well as detailed tutorials and a great customer support, you’ll have all the help you’ll need to get going in no time!

For the next instalment of our monthly media roundup using our News API, we thought we’d take a look at the content that was shared most on social media in the month of May. Finding out what content performs well on each social network gives us valuable insights into what media people are consuming and how this varies across different networks. To get these insights, we’re going to take a look at the most-shared content on Facebook, LinkedIn and Reddit.

Together, the stories we analyzed for this post were shared over 10 million times last month. Using the News API, we can easily extract insights about this content in a matter of minutes. With millions of new stories added every month in near real-time, News API users can analyze news content at any scale for whatever topic they want to dig into.

### Most Shared Stories on Each Social Network

Before we jump into all of this content, let’s take a quick look at what the top three most-shared stories on each social network were. Take particular note of the style of articles and the subject matter of each article and how they differ across each social network.

Most shared stories on Facebook in May

1. Drowning Doesn’t Look Like Drowning,” Slate, 1,337,890 shares.
2. This “All About That Bass” Cover Will Make Every Mom Crack Up,” Popsugar, 913,768 shares.
3. Why ’80s Babies Are Different Than Other Millennials,” Popsugar, 889,788 shares.

Most shared stories on LinkedIn in May

1. 10 Ways Smart People Stay Calm,” Huffington Post UK, 8,398 shares.
2. Pepsi Turns Up The Heat This Summer With Release Of Limited-Edition Pepsi Fire,” PR Newswire, 7,769 shares.
3. In Just 3 Words, LinkedIn’s CEO Taught a Brilliant Lesson in How to Find Great People,” Inc.com, 7,389 shares.

Most shared stories on Reddit in May:

1. Trump revealed highly classified information to Russian foreign minister and ambassador,” The Washington Post, 146,534 upvotes.
2. Macron wins French presidency by decisive margin over Le Pen,” The Guardian, 115,478 upvotes.
3. Youtube family who pulled controversial pranks on children lose custody,” The Independent, 101,153 upvotes.

## Content Categories

Even from the article titles alone, you can already see there is a difference between the type of stories that do well on each social network. Of course it’s likely you already knew this if you’re active on any of these particular social networks. To start our analysis, we decided to try and quantify this difference by gathering the most-shared stories on each network and categorizing them automatically using our News API to look for particular insights.

From this analysis, you can see a clear difference in the type of content people are more likely to share on each network.

LinkedIn users predictably share a large amount of career-focused content. However, more surprisingly stories which fall into the Society category were also very popular on LinkedIn.

### Reddit

Reddit is a content-sharing website that has a reputation for being a place where you can find absolutely anything, especially more random, alternative content than you would find on other social media. So it might come as a bit of a surprise to see that over half of the most-shared content on Reddit falls into just two categories, Politics and News.

#### Most-shared stories by category on Reddit in May

Not surprisingly our analysis, as shown in the pie chart below, shows that almost half of the most-shared stories on Facebook are about either entertainment or food.

#### Most-shared stories by category on Facebook in May

Note: As a reminder we’ve only analyzed the most shared, liked and upvoted content on each platform.

## Topics and Keywords

So far we’ve looked at what categories the most shared stories fall into across each social channel, but we also wanted to dig a little deeper into the topics they discussed in order to understand what content did better on each network. We can do this by extracting keywords, entities and concepts that were mentioned in each story and see which were mentioned most. When we do this, you can see a clear difference between the topics people share on each network.

Below, you can see the keywords from the most shared stories on LinkedIn. These keywords are mostly business-focused, which validates what we found with the categories feature above.

### Reddit

Likewise with Reddit, you can see below that the keywords validate what the categorization feature found – that most of the content is about politics and news.

#### Keywords extracted from the most-shared stories on Reddit in May

However on Facebook the most popular content tends to include mentions of family topics, like “father” and “kids,” and “baby” (with the obligatory mentions of “Donald Trump,” of course). This doesn’t correspond with what we found when we looked at what categories the stories belonged to – Arts & Entertainment and Food made up almost 50% of the most-shared content. Take a look below at what keywords appeared most frequently in the most-shared content.

#### Keywords extracted from the most-shared stories on Facebook in May

In order to find out why there wasn’t as clear a correlation between keywords and categories like we saw on the other platforms, we decided to dive into where this most shared content on Facebook was coming from. Using the source domain feature on the stories endpoint, we found that over 30% of the most shared content was published by one publication – Popsugar. Popsugar, for those who don’t know, is a popular lifestyle media publisher whose content is heavily weighted towards family oriented content with a strong celebrity slant. This means a lot of the content published on Popsugar could be categorized as Arts and Entertainment, while also talking about families.

## Content Length

After we categorized the stories and analyzed what topics they discuss, we also thought it might be interesting to understand what type of content, long-form or short-form, performs best across each platform. We wanted to see if the length of an article is a good indicator of how content performs on a social network. Our guess was that shorter pieces of content might perform best on Facebook while longer articles would most likely be more popular on LinkedIn. Using the word count feature on the histograms endpoint, it’s extremely easy to understand the the relationship between an article’s popularity and it’s length.

For example, below you can see that the content people shared most on Facebook was usually between 0 and 100 words in length, with people sharing longer posts on LinkedIn and Reddit.

## Conclusions

So to wrap up, we can come to some conclusions about what content people shared in May:

1. People shared shorter, family-oriented and lighthearted content on Facebook;
2. Longer, breaking news content involving Donald Trump dominated Reddit;
3. On LinkedIn, people shared both short and long content that mainly focused on career development and companies.

If you’d like to try the News API out for yourself, click on the image below to start your free 14-day trial, with no credit card required.

PREVIOUS POSTSPage 1 of 18NO NEW POSTS