Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78

It’s an exciting time here at AYLIEN – in the past couple of months, we’ve moved office, closed a funding round, and added six people to the team. We’re delighted to announce our most recent hire and our first Chief Architect and Principal Engineer, Hunter Kelly.

Hunter

The AYLIEN tech infrastructure has grown to quite a scale at this point. In addition to serving over 30,000 users with three product offerings, we’re also a fully-functional AI research lab that houses five full-time researchers, who in turn feed their findings back into the products. With such a complex architecture and backends that have highly demanding tasks, bringing an engineer with the breadth and quality of experience that Hunter has is a huge boost to us as we move into the next phase of our journey.

At first glance, Hunter’s career has followed a seemingly meandering path through some really interesting companies. After graduating from UC Berkeley in the 90s, he joined the Photoscience Department at Pixar in California, became one of the first engineers in Google’s Dublin office, and at NewBay, he designed and built a multi-petabyte storage solution for handling user-generated content, still in use by some of the world’s largest telcos today. Hunter is joining us from Zalando’s Fashion Insight Centre, where as the first engineer in the Dublin office he kicked off the Fashion Content Platform, which was shortlisted as a finalist in the 2017 DatSci Awards.

The common thread in those roles, while perhaps not obvious, is data. Hunter brings this rich experience working on data, both from an engineering and data science perspective, to focus on one extremely important problem – how can we leverage data to solve our hardest problems?

This question is central to AI research, and Hunter’s expertise is a perfect fit with AYLIEN’s mission to make Natural Language Processing hassle-free for developers. Our APIs handle the heavy lifting so developers can leverage Deep NLP in a couple of lines of code, and the ease with which our users do this is down to the great work our science and engineering teams do. Adding Hunter to the intersection of these teams will add a huge amount to our capabilities here and we’re really excited about the great work we can get done.

Here’s what Hunter had to say about joining the team:

“I’m really excited to be joining AYLIEN at this point in time. I think that AI and Machine Learning are incredibly powerful tools that everyone should be able to leverage. I really look forward to being able to bring my expertise and experience, particularly with large-scale and streaming data platforms, to AYLIEN to help broaden their already impressive offerings. AI is just reaching that critical point of moving beyond academia and reaching wide-scale adoption. Making its power accessible to the wider community beyond very focused experts is a really interesting and exciting challenge.”

When he’s not in AYLIEN, Hunter can be found spending time with his wife and foster children, messing around learning yet another programming language, painting minis, playing board games, tabletop RPG’s and wargames, or spending too much time playing video games.  He’s also been known to do some Salsa dancing, traveling, sailing, and scuba diving.

Check out Hunter’s talks on his most recent work at ClojureConj and this year’s Kafka Summit.






Text Analysis API - Sign up




 

0

Juggernaut is an experimental Neural Network, written in Rust. It is a feed-forward neural network that uses gradient descent to fit the model and train the network. Juggernaut enables us to build web applications that can train and evaluate a neural network model in the context of the web browser. This is done without having any servers or backends and without using Javascript to train the model.

Juggernaut’s developer-friendly API makes it easy to interact with. You can pass a dataset to Juggernaut from a CSV file or simply use the programmatic API to add documents to the model, and then ask the framework to train it. Juggernaut implements most activation functions as well as a few different cost functions, including Cross Entropy.

Juggernaut has a demo page, written with React and D3.js which illustrates the network, weights, and loss during a training session.

Demo

The demo page enables users to define a few options before starting the training session. These options are:

  • Dataset
  • Learning rate
  • Number of epochs (iterations)

In order to make the demo page more intuitive and easier to use, there are a few predefined datasets available on the page which load and illustrate data points from a CSV file. Each dataset has 3 classes, orange, blue and green and 2 features, X and Y.

Juggernaut three blobs one

After selecting the dataset and defining the options, you can start the training session by clicking the “Train” button on the page. Clicking on this button will spawn a new thread (web worker) and pass the dataset and parameters to the created thread.

During the training session, you can see the number of epochs, loss, and weights of the network. The web worker communicates with the main thread of the browser and sends the result back to the render thread to visualize each step of training.

Juggernaut Neural Net

The number of layers is predefined in the application. We have one input layer, two hidden layers, and one output. For hidden layers, we use ReLU activation function and the output layer uses Softmax with Cross Entropy cost function.

Compiling Rust to Web Assembly

Juggernaut’s demo page uses Web Assembly and HTML5 Web Worker to spawn a new thread inside the context of a web browser, and communicates between the web worker and the browser’s render thread (main thread) to train and evaluate the model.

Below is the process of compiling Rust to Web Assembly:

rust-javascript-17-638Juggernaut does not use any kind of Javascript codes to train and evaluate a model. However, it is still possible to run Juggernaut on modern web browsers, without having any backend servers as most modern web browsers, including IE 11 and portable web browsers on Android and iOS, support Web Assembly (source: http://caniuse.com/#search=wasm).

Importantly, the demo page uses a separate thread to train and evaluate a model and does not block the main thread or render thread of the web browser. So you can still interact with the UI elements of the page during training or you can keep the training session running for some time until receiving the accurate evaluation from the framework.

“Juggernaut”?

The real Juggernaut

Juggernaut is a Dota2 hero and I like this hero. Juggernaut is powerful, when he has enough farm.






Text Analysis API - Sign up





0

With the world’s media now publishing news content at web scale, it’s possible to leverage this content and discover what news the world is consuming in real time. Using the AYLIEN News API, you can both look for high-level trends in global media coverage and also dive into the content to discover what the world is talking about.   

So in this blog we’re going to take a high-level look at some of the more than 2.5 million stories our News API gathered, analyzed, and indexed last month, and see what we find. Instead of searching for stories using a detailed search query, we’re simply going to retrieve articles written in English and see what patterns emerge in this content.

To understand the distribution of articles published over time, we’ll use the Time Series endpoint. This endpoint allows us to see the volume of stories published over time, according to whatever parameters you set. To get an overview of recent trends in content publishing, we simply set the language parameter to English. Take a look at the pattern that emerges over the past two months:

The first thing you’ll notice is how steady the publishing industry’s patterns are – there is a steady output of around 60,000 new stories in English every weekday, dropping to about 30,000 stories on weekends. This pattern is very regular, except in the last week of the month, when a small but noticeable spike in story volume occurs.

What caused this spike in story volume?

To find the cause of these extra 2,000 – 4,000 stories, we browsed the volume numbers of the biggest categories to see if we could identify a particular subject category which followed the same pattern. We found an unmistakable match in the Finance category – as well as taking place in the same period, this spike also matches the volume of extra stories – roughly an extra 2,000 stories above the daily average.

In addition to this, we also found a similar spike at the end of July. Take a look at the daily story volume of finance stories published over the past six months:

What topics were discussed in this content?

Knowing that the increase in story volume in the last week of October was due to a spike in the number of Finance stories is great, but we can go further and see what was actually being talked about in these stories. To do this, we leveraged our News API’s capability to analyze the keywords, entities and concepts mentioned. The News API allows you to discover trends like this in news stories using the Trends endpoint.

Analyzing keywords lets us get an overview of what people, organizations, and things are mentioned most in the roughly 10,000 finance stories published in that week. Looking at the chart below, it’s a pretty easy to see what caused the spike.

From the results shown in the bubble chart it’s easy to see that keywords and concepts like “quarter”, “earnings”, “financial” and “company” were identified. From this analysis we can make a good guess that a lot of this content published in the last week of October was related to quarterly results and financial reporting by companies. This makes a lot of sense, since in the Time Series chart we could see that a similar spike occurred at the end of July, three months ago.

We thought this was interesting – why was so much published about something so arcane to the general public? In the first graph, we could see the spike that these quarterly earnings reports caused on story volume was visible on a chart of all stories published in the English language. But we don’t think quarterly earnings reports would make the everyday news content consumer drop everything they are doing to check the news.

Why were people interested in quarterly earnings reports?

So we know from the spike in story volume that the media were interested in the quarterly earnings reports. But what were social media users interested in? To find out, we decided to gather the most-shared stories from the Finance categories from the last week of October – the week of the spike. This will tell us if there was a particular aspect to the earnings reports that prompted such a spike in this topic.

The News API lets us do that with the Stories endpoint, by simply searching for the most-shared stories from the Finance category during last week of October across Facebook, LinkedIn, and Reddit.

You can see that of the nine stories we gathered, five are about the earnings reports, and despite this being quite a business-focused topic, Facebook was the network that this content was most popular on, not LinkedIn.

Facebook:

  1. Swiss bank UBS reports 14 percent growth in 3Q net profit,” Associated Press, 39,907 shares
  2. Amazon shares soar as earnings beat expectations,”Associated Press, 39,870 shares
  3. US stocks higher as banks and technology companies rebound,” Associated Press,  39,867 shares

LinkedIn:

  1. Jeff Bezos is now the richest man in the world with $90 billion,” CNBC, 7,318 shares
  2. CVS Reportedly Looking To Buy Aetna Insurance For $66 Billion,” Consumerist, 2,172 shares
  3. New Uber Visa Credit Card From Barclays Coming Next Week,” Forbes, 1,967 shares

Reddit:

  1. First reading on third-quarter GDP up 3.0%, vs 2.5% rise expected,” CNBC, 14,807 upvotes
  2. New study says Obamacare premiums will jump in 2018 — in large part because of Trump,” Business Insider, 6,255 upvotes
  3. MSNBC host literally left his seat to fact-check Jim Renacci,” Cleveland, 4,641 upvotes

 

You can see above that of the nine most-shared stories on social media on the week of the spike caused by the quarterly earnings reports, only five actually mention the reports. This suggests to us that although the media published a huge amount on the reports, people in general weren’t too interested in them.

Since we are basing this assumption on just a few headlines above, it’s just a hunch. But with the News API, we can put hunches like this to the test by analyzing quantitative data.

Exactly how interested were people in quarterly earnings reports?

In order to bit more accurate about how interested people were in the quarterly earnings reports, we compared the share counts of the 100 most-shared stories from the week of the spike with those from the corresponding week last month. We can do this using the News API’s Stories endpoint, since the News API monitors the share count of every story it indexes. Also, we’re going to look at Facebook since in the previous section it was the social network most interested in the quarterly earnings reports.

Take a look at how often people were sharing the 100 most-shared stories on Facebook in the last week of October:  

You can see that people were sharing Finance stories less often in the last week of October than the same period in September. This is interesting because we already saw that that there were over three times more Finance stories published in the same period, so we have to assume that people on social media generally just weren’t interested in these stories.

This piece of information is interesting because it shows us that looking at viral stories about a subject can mislead us about how interested people are about that subject.

Well that concludes this month’s roundup of news with the News API. If you want to dive into the world’s news content and use text analysis to extract insights, our News API has a two-week free trial, which you can activate by clicking the link below.






News API - Sign up





0

In a previous blog, we showed you how easy it is to set up a simple social listening tool to monitor chatter on Twitter. We showed you how using a single Python script you can gather  recent Tweets about a topic that interested you, analyze the sentiment of these tweets, and produce a visualization showing the sentiment.

But as we also pointed out in that blog, Twitter users post 350,000 new Tweets every minute, giving us an incredibly live dataset that will contain new information every time we query it. In fact, Twitter is so responsive to emerging trends that the US Geological Survey uses it to detect earthquakes – because the rate at which people Tweet about these earthquakes outpaces even their own data pipelines of geological and seismic data.

So if we can rely on Twitter users to Tweet about earthquakes during the actual earthquakes, we can absolutely rely on them to Tweet their opinions about subjects important to you, in real time. So while a single sentiment analysis gives you a snapshot of what people are saying at a single moment in time, it’s even more useful to analyze sentiment on a regular basis in order to understand how public opinion or even better your customers opinions can change over time

Why is having an automated sentiment analysis workflow useful?

There are two main reasons.

  • It will allow you to keep abreast of any trends in consumer sentiment shown towards you, your products, or your competitors.
  • Over time, this workflow will build up an extremely valuable longitudinal dataset that you can compare with sales trends, website traffic, or any of your KPIs.

In this blog, we’re going to show you how you can turn the original script we shared into a fully automated tool that will run your analysis at a given time everyday or however frequently you schedule it to run. It will gather Tweets, analyze their sentiment, and if you want it to, it will produce a temporary visualization so you can read the data quickly. It will also gradually build up a record of the Tweets and their sentiment by adding each day’s result to the same CSV file.

 

The 4 Steps for setting up your Twitter Sentiment Analysis Tool (in 20 mins)

This blog is split up into 4 steps, which all together should take about 20 minutes to complete:

  1. Get your credentials from Twitter and AYLIEN – both of these are free. (at 10 mins this is  the most time consuming part)
  2. Set up a folder and copy the Python script into it (2 mins)
  3. Run the Python script for your first sentiment analysis (2 mins)
  4. Schedule the script to run every day – we’ve included a detailed guide for both Windows and Mac below. (5 mins)

 

Step 1: Getting your credentials

If you completed the last blog, you can skip this part, but if you didn’t, follow these three steps:

  1. Make sure you have the following libraries installed (which you can do with pip):
  1. Get API keys for Twitter:
  • Getting the API keys from Twitter Developer (which you can do here) is the most time consuming part of this process, but this video can help you if you get lost.
  • What it costs & what you get: the free Twitter plan lets you download 100 Tweets per search, and you can search Tweets from the previous seven days. If you want to upgrade from either of these limits, you’ll need to pay for the Enterprise plan ($$)
  1. Get API keys for AYLIEN:
  • To do the sentiment analysis, you’ll need to sign up for our Text API’s free plan and grab your API keys, which you can do here.
  • What it costs & what you get: the free Text API plan lets you analyze 30,000 pieces of text per month (1,000 per day). If you want to make more than 1,000 calls per day, our Micro plan lets you analyze 80,000 pieces of text for ($49/month)

 

Step 2: Set up a folder and copy the Python script into it

Setting up a folder for this project will make everything a lot tidier and easier in the long run. So create a new one, and copy the Python script below into it. After you run this script for the first time, it will create a CSV in the folder, which is where it will store the Tweets and their sentiment every time it runs.

Here is the Python script:


import os
import sys
import csv
import tweepy
import matplotlib.pyplot as plt

from collections import Counter
from aylienapiclient import textapi

open_kwargs = {}

if sys.version_info[0] < 3:
    input = raw_input
else:
    open_kwargs = {'newline': ''}



# Twitter credentials
consumer_key = "Your consumer key here"
consumer_secret = "your secret consumer key here"
access_token = "your access token here"
access_token_secret = "your secret access token here"

# AYLIEN credentials
application_id = "your app id here"
application_key = "your app key here"

# set up an instance of Tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

# set up an instance of the AYLIEN Text API
client = textapi.Client(application_id, application_key)

#here insert a few lines to open the previous CSV file and read the last entry for id
max_id = 0

file_name = 'Sentiment_Analysis_of_Tweets_About_Your_query.csv'

if os.path.exists(file_name):
    with open(file_name, 'r') as f:
        for row in csv.DictReader(f):
            max_id = row['Tweet_ID']
else:
	    with open(file_name, 'w', **open_kwargs) as f:
	        csv.writer(f).writerow([
	                                "Tweet_ID",
	                                "Time",
	                                "Tweet",
	                                "Sentiment"])

results = api.search(
    lang="en",
    q="Your_query -rt",
    result_type="recent",
    count=10,
    since_id=max_id
)

results = sorted(results, key=lambda x: x.id)

print("--- Gathered Tweets \n")

# open a csv file to store the Tweets and their sentiment
with open(file_name, 'a', **open_kwargs) as csvfile:
    csv_writer = csv.DictWriter(
        f=csvfile,
        fieldnames=[
                    "Tweet_ID",
                    "Time",
                    "Tweet",
                    "Sentiment"]
    )

    print("--- Opened a CSV file to store the results of your sentiment analysis... \n")

    # tidy up the Tweets and send each to the AYLIEN Text API
    for c, result in enumerate(results, start=1):
        tweet = result.text
        tidy_tweet = tweet.strip().encode('ascii', 'ignore')
        tweet_time = result.created_at
        tweet_id = result.id

        if not tweet:
            print('Empty Tweet')
            continue

        response = client.Sentiment({'text': tidy_tweet})
        csv_writer.writerow({
        	"Tweet_ID": tweet_id,
        	"Time": tweet_time,
            'Tweet': response['text'],
            'Sentiment': response['polarity'],
        })

        print("Analyzed Tweet {}".format(c))

# count the data in the Sentiment column of the CSV file
with open(file_name, 'r') as data:
    counter = Counter()
    for row in csv.DictReader(data):
        counter[row['Sentiment']] += 1

    positive = counter['positive']
    negative = counter['negative']
    neutral = counter['neutral']

# declare the variables for the pie chart, using the Counter variables for "sizes"
colors = ['green', 'red', 'grey']
sizes = [positive, negative, neutral]
labels = 'Positive', 'Negative', 'Neutral'

# use matplotlib to plot the chart
plt.pie(
    x=sizes,
    shadow=True,
    colors=colors,
    labels=labels,
    startangle=90
)

plt.title("Sentiment of {} Tweets about Your Subject".format(sum(counter.values())))
plt.show()

If you want any part of that script explained, our previous blog breaks it up into pieces and explains each one.

Step 3: Run the Python script for your first sentiment analysis

Before you run the script above, you’ll need to make two simple changes. First, enter in your access keys from Twitter and AYLIEN. Second, don’t forget to enter in what it is your want to analyze! You’ll need to do this in two places first, on line 40 you should change the name of the CSV file that the script is going to create (currently it’s file_name = ‘Sentiment_Analysis_of_Tweets_About_Your_query.csv’). Second, on line 56, replace the text “Your Query” with whatever your query is.

Also, if you don’t feel that you need a daily report based and you’re more interested in building up a record of sentiment analysis, just delete everything below “## count the data in the Sentiment column of the CSV file”. This will carry out the sentiment analysis every day and add the results to the CSV file,  but won’t show a visualization.

After you’ve run this script, your folder should contain the Python script and the CSV file.

Step 4: Schedule the Python script to run every day.

So now that you’ve got your Python script saved in a folder along with a CSV file containing the results of your first sentiment analysis, we’re ready for the final step – scheduling the script to run to a schedule that suits you.

Depending on whether you use Windows or Mac, there are different steps to take here, but you don’t need to install anything either way. Mac users will use Cron, whereas Windows users can use Task Scheduler.

Windows:

  1. Open Task Scheduler (search for it in the Start menu)
  2. Click into Task Scheduler Library on the left of your screen
  3. Open the ‘Create Task’ box on the right of your screen
  4. Give your task a name
  5. Add a new Trigger: select ‘daily’ and enter the time of day you want your computer to run the script. Remember to select a time of day that your computer is likely to be running.
  6. Open up the Actions tab and click New
  7. In the box that opens up, make sure “Start a program” is selected
  8. In the “Program/Script” box, enter the path to your python.exe file and make sure this is enclosed in quotation marks (so something like “C:\Program Files (x86)\Python36-32\python.exe”)
  9. In the “Add arguments” box, enter the path to the dailySentiment.py file, including the file itself (so something like C:\Users\Yourname\desktop\folder\dailySentiment.py). No quotation marks are needed here.
  10. In the “Start in” box, enter the path to the containing folder with your script and CSV file. (C:\Users\Yourname\desktop or wherever your folder is\the name of your folder\). Again, no quotation marks are needed.
  11. You’re done!

Mac:

  1. Open up a Terminal
  2. Type “crontab – e” to create a new Cron job.
  3. Scheduling the Cron job takes one line of code, that is split into three parts.
  4. First, type the time of the week you want your script to run, according to the Cron format, which is “minute hour date month weekday” all in integers, all separated by single spaces. For example, if you want your script to gather Tweets every day at 9AM, this first part of the line of code will read “ 0 9 * * * ”   – minute zero of hour nine, every day of every month.
  5. Second, leave a space after this first line and type in the location of your Python executable file. This part will usually read something like “System/Library/Frameworks/Python.framework/Python”
  6. Finally, enter the path to the Python script in your folder. For example, if you saved the scripts to a folder on your desktop, the path will be something like “Users/Your name/Desktop/folder name/dailySentiment.py”
  7. The full line of code in your Terminal will look something like “0 9 * * * /System/Library/Frameworks/Python.framework/Python Users/Your name/Desktop/folder name/dailySentiment.py”.
  8. Now hit Escape, then type “:wq”, and hot Enter.
  9. To double check that your Cron job is scheduled, type “crontab -l” and you should see your job listed.

If you run into trouble, get in touch!

With those four steps, your automated workflow should be up and running, but depending on how your system is set up, you could run into an error along the way. If you do, don’t hesitate to get in touch by sending us an email, leaving a comment, or chatting with us on our site.

Happy analyzing!






Text Analysis API - Sign up




0

Twitter users around the world post around 350,000 new Tweets every minute, creating 6,000 140-character long pieces of information every second. Twitter is now a hugely valuable resource from which you can extract insights by using text mining tools like sentiment analysis.

Within the social chatter being generated every second, there are vast amounts of hugely valuable insights waiting to be extracted. With sentiment analysis, we can generate insights about consumers’ reactions to announcements, opinions on products or brands, and even track opinion about events as they unfold. For this reason, you’ll often hear sentiment analysis referred to as “opinion mining”.

With this in mind, we decided to put together a useful tool built on a single Python script to help you get started mining public opinion on Twitter.

What the script does

Using this one script you can gather Tweets with the Twitter API, analyze their sentiment with the AYLIEN Text Analysis API, and visualize the results with matplotlib – all for free. The script also provides a visualization and saves the results for you neatly in a CSV file to make the reporting and analysis a little bit smoother.

Here are some of the cool things you do with this script:

  • Understand the public’s reaction to news or events on Twitter
  • Measure the voice of your customers and their opinions on you or your competitors
  • Generate sales leads by identifying negative mentions of your competitors

You can see the script running a sample analysis of 50 Tweets mentioning Tesla in our example GIF below – storing the results in a CSV file and showing a visualization. The beauty of the script is you can search for whatever you like and it will run your tweets through the same analysis pipeline. 😉

Tesla Sentiment

 

Installing the dependencies & getting API keys

Since doing a sentiment analysis of Tweets with our API is so easy, installing the libraries and getting your API keys is by far the most time-consuming part of this blog.

We’ve collected them here as a four-step to do list:

  1. Make sure you have the following libraries installed (which you can do with pip):
  1. Get API keys for Twitter:
  • Getting the API keys from Twitter Developer (which you can do here) is the most time consuming part of this process, but this video can help you if you get lost.
  • What it costs & what you get: the free Twitter plan lets you download 100 Tweets per search, and you can search Tweets from the previous seven days. If you want to upgrade from either of these limits, you’ll need to pay for the Enterprise plan ($$)
  1. Get API keys for AYLIEN:
  • To do the sentiment analysis, you’ll need to sign up for our Text API’s free plan and grab your API keys, which you can do here.
  • What it costs & what you get: the free Text API plan lets you analyze 30,000 pieces of text per month (1,000 per day). If you want to make more than 1,000 calls per day, our Micro plan lets you analyze 80,000 pieces of text for ($49/month)
  1. Copy, paste, and run the script below!

 

The Python script

When you run this script it will ask you to specify what term you want to search Tweets for, and then to specify how many Tweets you want to gather and analyze.


import sys
import csv
import tweepy
import matplotlib.pyplot as plt

from collections import Counter
from aylienapiclient import textapi

if sys.version_info[0] < 3:
   input = raw_input

## Twitter credentials
consumer_key = "Your consumer key here"
consumer_secret = "your secret consumer key here"
access_token = "your access token here"
access_token_secret = "your secret access token here"

## AYLIEN credentials
application_id = "Your app ID here"
application_key = "Your app key here"

## set up an instance of Tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

## set up an instance of the AYLIEN Text API
client = textapi.Client(application_id, application_key)

## search Twitter for something that interests you
query = input("What subject do you want to analyze for this example? \n")
number = input("How many Tweets do you want to analyze? \n")

results = api.search(
   lang="en",
   q=query + " -rt",
   count=number,
   result_type="recent"
)

print("--- Gathered Tweets \n")

## open a csv file to store the Tweets and their sentiment 
file_name = 'Sentiment_Analysis_of_{}_Tweets_About_{}.csv'.format(number, query)

with open(file_name, 'w', newline='') as csvfile:
   csv_writer = csv.DictWriter(
       f=csvfile,
       fieldnames=["Tweet", "Sentiment"]
   )
   csv_writer.writeheader()

   print("--- Opened a CSV file to store the results of your sentiment analysis... \n")

## tidy up the Tweets and send each to the AYLIEN Text API
   for c, result in enumerate(results, start=1):
       tweet = result.text
       tidy_tweet = tweet.strip().encode('ascii', 'ignore')

       if len(tweet) == 0:
           print('Empty Tweet')
           continue

       response = client.Sentiment({'text': tidy_tweet})
       csv_writer.writerow({
           'Tweet': response['text'],
           'Sentiment': response['polarity']
       })

       print("Analyzed Tweet {}".format(c))

## count the data in the Sentiment column of the CSV file 
with open(file_name, 'r') as data:
   counter = Counter()
   for row in csv.DictReader(data):
       counter[row['Sentiment']] += 1

   positive = counter['positive']
   negative = counter['negative']
   neutral = counter['neutral']

## declare the variables for the pie chart, using the Counter variables for "sizes"
colors = ['green', 'red', 'grey']
sizes = [positive, negative, neutral]
labels = 'Positive', 'Negative', 'Neutral'

## use matplotlib to plot the chart
plt.pie(
   x=sizes,
   shadow=True,
   colors=colors,
   labels=labels,
   startangle=90
)

plt.title("Sentiment of {} Tweets about {}".format(number, query))
plt.show()

If you’re new to Python, text mining, or sentiment analysis, the next sections will walk through the main sections of the script.

 

The script in detail

Python 2 & 3

With the migration from Python 2 to Python 3, you can run into a ton of problems working with text data (if you’re interested, check out a great summary of why by Nick Coghlan). One of the changes is that Python 3 runs input() as a string, whereas Python 2 runs input() as a Python expression, so these lines change this to raw_input() if you’re running Python 2.


if sys.version_info[0] < 3:
   input = raw_input

Input your search

The goal of this post is to make it as quick and easy as possible to analyze the sentiment of Tweets that interest you. This script does that by letting you easily change the search term and sample size every time you run the script from the shell using the input() method.


query = input("What subject do you want to analyze for this example? \n")
number = input("How many Tweets do you want to analyze? \n")

Run your Twitter query

We’re grabbing the most recent Tweets relevant to your query, but you can change this to ‘popular’ if you want to mine only the most popular Tweets published, or ‘mixed’ for a bit of both. You can see we’ve also decided to exclude retweets, but you might decide that you want to include them. You can check the full list of parameters here. (From our experience there can be a lot of noise in retaining Tweets that have been Retweeted.)

An important point to note here is that the Twitter API limits your results to 100 Tweets, and it doesn’t return an error message if you try to search for more than 100 Tweets. So if you input 500 Tweets, you’ll only have 100 Tweets to analyze, and title of your visualization will still read ‘500 Tweets.’


results = api.search(
   lang="en",
   q=query + " -rt",
   count=number,
   result_type="recent"
)

Open a CSV file for the Tweets & Sentiment Analysis

Writing the Tweets and their sentiment to a CSV file allows you to review the API’s analysis of each Tweet. First, we open a new CSV file and write the headers.


with open(file_name, 'w', newline='') as csvfile:
   csv_writer = csv.DictWriter(
       f=csvfile,
       fieldnames=["Tweet", "Sentiment"]
   )
   csv_writer.writeheader()

Tidy the Tweets

Dealing with text on Twitter can be messy, so we’ve included this snippet to tidy up the Tweets before you do the sentiment analysis. This means that your results are more accurate, and you also don’t waste your free AYLIEN credits on empty Tweets. 😉


for c, result in enumerate(results, start=1):
   tweet = result.text
   tidy_tweet = tweet.strip().encode('ascii', 'ignore')

   if len(tweet) == 0:
       print('Empty Tweet')
       continue

Write the Tweets & their Sentiment to the CSV File

You can see that actually getting the sentiment of a piece of text only takes a couple of lines of code, and here we’re writing the Tweet itself and the result of the sentiment (positive, negative, or neutral) to the CSV file under the headers we already wrote. You’ll notice that we’re actually writing the Tweet as returned by the AYLIEN Text API instead of the Tweet we got from the Twitter API. Even though they’re both the same, writing the Tweet that the AYLIEN API returns just reduces the potential for errors and mistakes.  

We’re also going to print something every time the script analyzes a Tweet.


response = client.Sentiment({'text': tidy_tweet})
csv_writer.writerow({
   'Tweet': response['text'],
   'Sentiment': response['polarity']
})

print("Analyzed Tweet {}".format(c))

Screenshot (546)

If you want to include results on how confident the API is in the sentiment it detects in each Tweet: just add  “response[‘polarity_confidence’]” above and add a corresponding header when you’re opening your CSV file.

Count the results of the Sentiment Analysis

Now that we’ve got a CSV file with the Tweets we’ve gathered and their predicted sentiment, it’s time to visualize these results so we can get an idea of the sentiment immediately. To do this, we’re just going to use Python’s standard counter library to count the number of times each sentiment polarity appears in the ‘Sentiment’ column.


with open(file_name, 'r') as data:
   counter = Counter()
   for row in csv.DictReader(data):
       counter[row['Sentiment']] += 1

   positive = counter['positive']
   negative = counter['negative']
   neutral = counter['neutral']

Visualize the Sentiment of the Tweets

Finally, we’re going to plot the results of the count above on a simple pie chart with matplotlib. This is just a case of declaring the variables and then using matplotlib to base the sizes, labels, and colors of the chart on these variables.


colors = ['green', 'red', 'grey']
sizes = [positive, negative, neutral]
labels = 'Positive', 'Negative', 'Neutral'

## use matplotlib to plot the chart
plt.pie(
   x=sizes,
   shadow=True,
   colors=colors,
   labels=labels,
   startangle=90
)

plt.title("Sentiment of {} Tweets about {}".format(number, query))
plt.show()

Screenshot (542)

Go further with Sentiment Analysis

If you want to go further with sentiment analysis you can try two things with your AYLIEN API keys:

  • If you’re looking into reviews of restaurants, hotels, cars, or airlines, you can try our aspect-based sentiment analysis feature. This will tell you what sentiment is attached to each aspect of a Tweet – for example positive sentiment shown towards food but negative sentiment shown towards staff.
  • If you want sentiment analysis customized for the problem you’re trying to solve, take a look at TAP, which lets you train your own language model from your browser.

Building a Sentiment Analysis Workflow for your Organization

This script is built to give you a snapshot of sentiment at the time you run it, so to keep abreast of any change in sentiment towards an organization you’re interested in, you should try running this script every day.

In our next blog, we’ll have a couple of simple updates for this script that will set up a simple, fully automated process that will keep an eye on the sentiment on Twitter for your anything that you’re interested in.






Text Analysis API - Sign up




4

Last month, our News API gathered, analyzed, and indexed over 2.3 million news stories in near-real time, giving us the ability to spot trends in what the world’s media talking about and dive into specific topics to understand how they developed over time.

In this blog, we’re going to look into two interesting events from September which gathered a lot of media attention:

  1. The launch of the iPhone X.
  2. Ryanair’s ongoing cancellations problem.

September Media Roundup with the AYLIEN News API (2)

The iPhone X Launch

Looking into stories in the technology category, our News API detected a spike in stories published on the 12th of September. You can see that almost 5,000 stories about “Technology” were published that day – an increase of over 20% more than the average weekday. A large portion of these stories were covering the launch of the highly anticipated iPhone X.


How did the media react to the iPhone X launch?

Knowing that the launch of the iPhone X caused a spike in media interest is great because it lets us measure the hype associated with the event. But using the News API, we can go a step further and dig deeper into the content to better understand the media reaction. To do this, we used the Trends endpoint to analyze which entities were mentioned in all news stories.

Using the Trends endpoint, our users can make an unlimited number of queries about quantitative trends in news content, making it easy to spot trending topics in a collection of documents. For example, take a look at what entities were mentioned most in stories about the iPhone X.


On the chart above you can see that the media mentioned two types of entities the most: the most-mentioned entities are somewhat obvious and expected (Apple and iPhone). However, the articles also mention other entities like Tim Cook and Cupertino which help set stories in context, the “who, what, and where” of the story.

But after these most popular entities, we can group some slightly lesser-mentioned entities together, these entities are competitor and product focused (Samsung, S8, Apple Watch). The prominence of these entities shows that the media was very interested in talking about the iPhone X in the context of how it added to Apple’s product offering, and how this offering compared to its competitors.

How did the reaction on social media compare?

So when it came to the iPhone X, the media were talking about Apple and its competitors, but what were people online talking about? We decided to compare the media coverage from the News API with reaction on Twitter to try and gauge the customer reaction to the launch. To do this, we used the Twitter API to gather 10,000 Tweets and our Text API to extract the entities mentioned in every Tweet.

On Twitter, the 140 character limit means people have to jump right into what they want to say, so you can see right away how the content of Tweets differed from the content of news stories.


You can see here that the conversation on Twitter we found was more focused on the product itself than the business implications for Apple. You can see this in the fact RAM is the most mentioned entity (the iPhone X has no increase in RAM from previous iPhones), while OLED, iOS, and the Galileo GPS system are much more prevalent here than in the News API results.

So from this data, you can see that Twitter users were more focused on the product itself – making them a good insight into the voice of the customer, whereas the media were more focused on producing insights into the business implications of the phone.

 

Ryanair’s bad PR Month (even by Ryanair standards)

In a previous blog, we used our News API and our Text API to analyze how Ryanair handled the initial announcement of their flight cancellations disaster. As we wrote that blog only days after the announcement of the cancellations, we decided to check in with the airline again to see how they’ve fared since then.

Below, you can see the volume of stories published about Ryanair in September and their sentiment. The first spike in extremely negative press covers the weekend that the airline announced the cancellations, and the second spike is the coverage of the announcement of further cancellations.


Was it only the Cancellations that Brought Coverage for Ryanair?

In the chart above you can also see a spike in stories about Ryanair on the Thursday following the initial announcement. To find out what all of this coverage was about, we used the Trends endpoint of the News API to find out what the most-mentioned entities in Ryanair stories were in the week following that big spike caused by the cancellations announcement.


You can see above that all of the most-mentioned entities are aviation- or Ryanair-related. This is useful, as it gives us insight into what other people and places were talked about in the Ryanair coverage, but it doesn’t give us an insight into what the story spike was about. To do that, we’ll analyze the same time period with the keywords feature.

The keywords feature produces a different kind of insight into text data: whereas the entities feature returns specific things like people, locations, and products, the keywords feature will return mentions of things in general like ‘flights,’ ‘week,’ and ‘pilots’. Take a look at the chart below to see the insights are different from what the entities endpoint returned on the same data.


You can see in the chart above that the keyword ‘pilots’ is more prevalent than ‘cancellations,’ so we can guess that in the days after interest in the cancellations dies down, the media became more interested in the story about pilots leaving Ryanair en masse than the cancellations.

To test this idea out, we used the News API’s Time Series endpoint to compare mentions of each of these keywords in stories with ‘Ryanair’ in the title. You can see that although the two issues were covered a lot together, coverage of Ryanair’s pilot trouble was more popular in between Ryanair’s two cancellation announcements. This shows that despite the huge coverage on passenger outrage, the media focused on the more troubling business news for Ryanair – its toxic relationship with its own pilots.


So that concludes our quick analysis of last month’s news with the News API. If you’re interested in using our AI-powered text analysis engine to analyze news content for your own solution, click on the image below and sign up for a free trial!






News API - Sign up





0

Just six months on from our last blog tracking growth here at AYLIEN, we have more updates to keep you posted on. Since then, we’ve added five more to the team – two scientists, an office manager, an engineer, and a business development manager.

We’re now a team of 17 comprised of eight nationalities, and we’re really excited to share this new step we’re taking. Meet the newest additions to the AYLIEN team!

Chris – Research Scientist

Chrishokamp-Team-pic-e1504131476238

Chris has been with us part-time for a few months now while he finished his PhD, where he designed neural networks for machine translation, and now he’s joining us full-time to advance our understanding of entity linking. Starting as a Music and German undergrad, he dived into Computational Linguistics after his Master’s degree and hasn’t looked back since. Take a look at his research publications or follow him on Twitter.

A native of Denton, Texas, Chris is also keen outdoorsman and quite an accomplished percussionist – he usually plays in the Grand Social on Mondays with Polish Folk collective The Supertonic Orchestra.

 

Damien – Business Development Manager

Damien-Team-pic-e1504131465354

Joining us after lecturing in Music in the British and Irish Modern Music Institute, Damien comes on board to help our customers get the most out of AYLIEN’s products. Damien is going to use the experience he gathered over eight years in sales and customer support in tech companies.

Outside of AYLIEN, Damien is also a professional composer, scoring films and video games, he plays alto sax in a Ska/Punk band called Bocs Social, and writes for a blog he started, Audio Dexterous. Damien and Chris bring the number of serious musicians at AYLIEN to three (we’ve all heard @anotherjohng whistling in the lift).

 

Francesca – Office Manager

Francesca-Team-pic-e1504131437924

Francesca is taking over everything admin-related in AYLIEN. In the six weeks she’s been with us, she has become the go-to person for most things that happen in the office, and has earned the undying gratitude of some of the team by helping them deal with the Irish Naturalisation and Immigration Service, which can be a little… let’s say, tiring. But having been on the organisation board of PyCon Italy for ten years, keeping things running in a tech company is nothing new for Francesca.

Coming from Parma (yes, the home of the ham and the cheese), Francesca did a degree in Food Science and Technology, and she is an avid reader and board gamer.

 

Ian – Postdoctoral Fellow

Ian-Team-pic-e1504131419606

The first Aussie on the team, Ian is also our first postdoc from academia to be based in the AYLIEN office, and the third member of the research team holding a PhD. Our first Science Foundation Ireland Industry fellow, we covered Ian’s placement in a blog post last month. His research focuses on building neural networks that analyze emotion in text.

Outside of AYLIEN, Ian has done a little sailing (just across the Atlantic, no big deal), speaks quite a few languages (Czech, Spanish, and Italian, but can get by in French, Hindi, Turkish and German), and is a pretty decent cook.

 

Sven – Site Reliability Engineer

Sven-Team-pic-e1504131341261

Sven joins us to take responsibility for the infrastructure of every part of AYLIEN – making everything as reliable and resilient as possible. When he’s not hunting down potential failure points and eliminating manual processes in AYLIEN’s tech, Sven is usually contributing to open-source software and coding interesting projects, and you can take a look at his code on GitHub.

A Madrid native of German origin, Sven cycles to work via the gym every morning, so by 9AM he’s done more exercise than the rest of us combined! Having been around the world quite a bit, he’s now getting around Ireland bit by bit – a winter trip to Connemara is up next!

 

So that’s the update for the summer, we’re growing at a quick rate across research, engineering, and sales, so if you think you’d work well on the team, drop us a line at jobs@aylien.com. If you’re an NLP or machine learning person with a research idea, check out our research page – maybe there’s something for us to work on together.

0

For years Ryanair and Michael O’Leary have handled the media with a deft touch. Think of all the free advertising Michael O’Leary has garnered for the company, and how they have dealt with extraordinarily negative opinion of their service offering while growing their business into one of the biggest airlines in Europe.

But last week, Ryanair were dealt an incredible blow when they had to immediately cancel around 1,900 flights due to an administrative problem on their part. This is one of the biggest PR challenges they have faced, so we wanted to see how effectively Ryanair handled the news. With our APIs, we can use text analysis to measure the reaction to the announcement, both in the news and on social media.

 

mol (2)

 

To do this, we collected just under 600 news stories about the cancellations, and  30,000 Tweets mentioning Ryanair over the past week. We’ve isolated the following specific questions we’re going to answer:

  1. Ryanair announced the news late on Friday, an old PR trick to minimize negative coverage. How effective was this in the news cycle?
  2. Does this old-school PR trick have an effect on the social media reaction?
  3. Exactly how negative was the press coverage that Ryanair was trying to minimize?
  4. How many of the Tweets mentioning Ryanair were affected by the cancellations and how many were just jumping on the bandwagon?

 

1. Did Ryanair’s Friday evening press dump work?

Ryanair first started cancelling flights on Friday morning, but decided not to officially announce the news until later that evening, which is a common PR trick – announce bad news late on a Friday so the press coverage is affected by the weekend lull, hopefully resulting in less coverage.
 

Using the News API, we can see that this strategy worked pretty well for Ryanair. To try and understand how much coverage the news got, we decided to track the volume of news stories published over the weekend period using our Time Series endpoint. We also wanted to see how the news spread across social media so we collected Tweets directed at Ryanair using, the Twitter API over the same period. In total we collected just under 30,000 Tweets and about 600 news articles. As all of our data points were time stamped it was easy to plot them side by side on one chart and compare the volume from each channel. Take a look below:

You can see above that announcing the news late on Friday meant that while the conversation about Ryanair took off on Twitter over the weekend, it took the press until Monday to catch up. It’s important to note here that on Monday there was no new news to be released except a detailed list of the flights affected – the press coverage on Monday was essentially working off old news. So by the comparatively small amount of stories over the weekend we can see that Ryanair successfully got out ahead of the story and minimized the immediate impact.  

But if people are talking about the cancellations online, what does it matter if there were fewer stories in the press?

 

2. This old-school PR trick is important for PR on Facebook too

The previous chart shows that although there was a huge amount of chatter about the Ryanair cancellations on social media, at the same time there were fewer news stories being written. The obvious implication here is that there were fewer stories for social media users to share – more journalists off for the weekend meant fewer stories were appearing in people’s news feeds.
 

So to put this idea to the test, we used the Trends endpoint of the News API to gather the 10 most-shared stories about Ryanair on Facebook on Friday, Saturday, and Sunday, and counted how many times they were shared. Take a look at how many more shares the rest of the stories got on Monday than on Saturday, even though the news was three days old at that point! With this information, we think it’s a good bet that stories about Ryanair got much more shares on Monday because more people had stories in their news feeds.

To be exact, Monday’s top 10 most-shared stories were shared over 43,000 times more than Saturday’s 10 most-shared stories, simply because news publishers were back in full swing (remember – there was no new important information). This implies that however bad the coverage was on Monday (and it was bad, as we’ll see) the old PR trick of dumping stories on a Friday evening also has an effect on the spread of news on social media. It’s difficult to quantify the combined reach of these extra 43,000 shares, but that’s a lot of extra negative publicity that Ryanair avoided!

We thought this was interesting – although social media is a huge disruptor of the publishing industry, the fact that the old Friday evening press dump works on Facebook sharing tells us that traditional journalism still guides what people are talking about on social media.

 

3: How negative was the press coverage that Ryanair was trying to minimize?

So far we’ve assumed that the stories about Ryanair were largely negative, but we haven’t looked into how negative they were. Using the Trends endpoint of the News API, we can do just that, and we found that 85% of stories about Ryanair had a negative tone last week. Take a look at how the sentiment changed over each day last week. You can see it gets extremely negative from Friday onwards.

 

4: Did people ‘swarm’ the bad news on Twitter?

We pointed out that there was a massive spike in mentions of Ryanair on Twitter on Saturday, before the media could cover the story as extensively as they would on a weekday. This gave us an interesting opportunity – sometimes PR research can be hampered by swarming, which is when people unaffected by a problem jump on the bandwagon and add to the negative press.

So we came up with an idea to separate affected customers from people swarming in this example. To identify those actually affected by the events and those who were jumping on the band wagon we narrowed our search to Twitter users who mentioned flight details like locations of departure or arrival and those who didn’t. We made the distinction that those who hadn’t mentioned any specific details about their flight were more than likely swarmers while those who gave specifics were actually affected by the cancellations.
 
Take a look at these two Tweets as an example, one is a customer affected by cancellations and another is just someone who wanted to say something negative about Ryanair after the news broke.




Our Text API’s Concept Extraction feature allows users to extract any mention of organizations, people, products and locations. Using this capability, we decided to see how many Tweets between Friday evening and Saturday afternoon (after the news broke) mentioned a location, how many only mentioned Ryanair, and how many were talking about any other concept. Take a look at the results:



The Concepts feature lets us dive in further to the data by telling us what exact concepts were mentioned. We’ll take a look at everything we’ve grouped into ‘other concepts’, but first let’s see which locations people were talking about in the Tweets with the @Ryanair handle – you can see they correlate with the airports affected by the cancellations:



We then compared the data on what people were talking about on Friday with what they were talking about on Monday, to see what changed when the press were covering the story in full. You can see that mentions Ryanair alone dropped significantly, but mentions of all other concepts rose correspondingly. We’re guessing the mentions of locations stayed the same because the only new announcement on Monday was a detailed list of flights that were cancelled. Take a look at the Twitter conversation about Ryanair on Monday:



In the chart above, you can see the rise of what we’ve labeled ‘other concepts’. We can use the Text API to see what these are. Take a look:



So that’s our AI-powered layman’s analysis of the coverage of the Ryanair cancellations PR disaster. We think that although the press is extremely negative, Ryanair did a pretty good job of mitigating the volume of it.

If you want to try out our APIs, take a look at the demos of our APIs or sign up for a trial – the News API has a two-week free trial and the Text API has a free plan that lets you analyze 1,000 pieces of text per month with no cost.

 






News API - Sign up





0

Four members of our research team spent the past week at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2017) in Copenhagen, Denmark. The conference handbook can be found here and the proceedings can be found here.

The program consisted of two days of workshops and tutorials and three days of main conference. Videos of the conference talks and presentations can be found here.The conference was superbly organized, had a great venue, and a social event with fireworks.

fireworks

Figure 1: Fireworks at the social event

With 225 long papers, 107 papers, and 9 TACL papers accepted, there was a clear uptick of submissions compared to last year. The number of long and short paper submissions to EMNLP this year was even higher than those at ACL for the first time within the last 13 years, as can be seen in Figure 2.

emnlp

Figure 2: Long and short paper submissions at ACL and EMNLP from 2004-2017

In the following, we will outline our highlights and list some research papers that caught our eye. We will first list overall themes and will then touch upon specific research topics that are in line with our areas of focus. Also, we’re proud to say that we had four papers accepted to the conference and workshops this year! If you want to see the AYLIEN team’s research, check out the research sections of our website and our blog. With that said, let’s jump in!

 

Exciting Datasets

Evaluating your approach on CoNLL-2003 or PTB is appropriate for comparing against previous state-of-the-art, but kind of boring. The two following papers introduce datasets that allow you to test your model in more exciting settings:

  • Durrett et al. release a new domain adaptation dataset. The dataset evaluates models on their ability to identify products being bought and sold in online cybercrime forums.  
  • Kutuzov et al. evaluate their word embedding model on a new dataset that focuses on predicting insurgent armed groups based on geographical locations.
  • While he did not introduce a new dataset, Nando de Freitas made the point during his keynote that the best environment for learning and evaluating language is simulation.

sdf

Figure 3: Nando de Freitas’ vision for AI research

Return of the Clusters

Brown clusters, an agglomerative, hierarchical clustering of word types based on contexts that was introduced in 1992 seem to come in vogue again. They were found to be particularly helpful for cross-lingual applications, while clusters were key features in several approaches:

  • Mayhew et al. found that Brown cluster features were an important signal for cross-lingual NER.
  • Botha et al. use word clusters as a key feature in their small, efficient feed-forward neural networks.
  • Mekala et al.’s new document representations cluster word embeddings, which give it an edge for text classification.
  • In his talk at the SCLeM workshop, Noah Smith cites the benefits of using Brown clusters as features for tasks such as POS tagging and sentiment analysis.

emnlp3

Figure 4: Noah Smith on the benefits of clustering in his invited talk at the SCLeM workshop

Distant Supervision

Distant supervision can be leveraged to collect large amounts of noisy training data, which can be useful in many applications. Some papers used novel forms of distant supervision to create new corpora or to train a model more effectively:

  • Lan et al. use urls in tweets to collect a large corpus of paraphrase data. Paraphrase data is usually hard to create, so this approach facilitates the process significantly and enables a continuously expanding collection of paraphrases.
  • Felbo et al. show that training on fine-grained emoji detection is more effective for pre-training sentiment and emotion models. Previous approaches primarily pre-trained on positive and negative emoticons or emotion hashtags.

Data Selection

The current generation of deep learning models is excellent at learning from data. However, we often do not pay much attention to the actual data our model is using. In many settings, we can improve upon the model by selecting the most relevant data:

  • Fang et al. reframe active learning as reinforcement learning and explicitly learn a data selection policy. Active learning is one of the best ways to create a model with as few annotations as possible; any improvement to this process is beneficial.
  • Van der Wees et al. introduce dynamic data selection for NMT, which varies the selected subset of the training data between different training epochs. This approach has the potential to reduce the training time of NMT models at comparable or better performance.
  • Ruder and Plank use Bayesian Optimization to learn data selection policies for transfer learning and investigate how well these transfer across models, domains, and tasks. This approach brings us a step closer towards gaining a better understanding of what constitutes similarity between different tasks and domains.

Character-level Models

Characters are nowadays used as standard features in most sequence models. The Subword and Character-level Models in NLP workshop discussed approaches in more detail, with invited talks on subword language models and character-level NMT.

  • Schmaltz et al. find that character-based sequence-to-sequence models outperform word-based models and models with character convolutions for sentence correction.
  • Ryan Cotterell gave a great, movie-inspired tutorial on combining the best of FSTs (cowboys) and sequence-to-sequence models (aliens) for string-to-string transduction. While evaluated on morphological segmentation, the tutorial raised awareness in an entertaining way that often the best of both worlds, i.e. a combination of traditional and neural approaches performs best.

emnlp4

Figure 5: Ryan Cotterell on combining FSTs and seq2seq models for string-to-string transduction

Word Embeddings

Research in word embeddings has matured and now mainly tries to 1) address deficits of word2vec, such as its ability of dealing with OOV words; 2) extend it to new settings, e.g. modelling the relations of words over time; and 3) understand the induced representations better:

  • Pinter et al. propose an approach for generating OOV word embeddings by training a character-based BiLSTM to generate embeddings that are close to pre-trained ones. This approach is promising as it provides us with a more sophisticated way to deal with out-of-vocabulary words than replacing them with an <UNK> token.
  • Herbelot and Baroni slightly modify word2vec to allow it to learn embeddings for OOV words from few data.
  • Rosin et al. propose a model for analyzing when two words relate to each other.
  • Kutuzov et al. propose another model that analyzes how two words relate to each other over time.
  • Hasan and Curry improve the performance of word embeddings on word similarity tasks by re-embedding them in a manifold.
  • Yang et al. introduce a simple approach to learning cross-domain word embeddings. Creating embeddings tuned on a small, in-domain corpus is still a challenge, so it is nice to see more approaches addressing this pain point.
  • Mimno and Thompson try to understand the geometry of word2vec better. They show that the learned word embeddings are positioned diametrically opposite of their context vectors in the embedding space.

Cross-lingual transfer

An increasing number of papers evaluate their methods on multiple languages. In addition, there was an excellent tutorial on cross-lingual word representations, which summarized and tried to unify much of the existing literature. Slides of the tutorial are available here.

  • Malaviya et al. train a many-to-one NMT to translate 1017 languages into English and use this model to predict information missing from typological databases.
  • Mayhew et al. introduce a cheap translation method for cross-lingual NER that only requires a bilingual dictionary. They even perform a case study on Uyghur, a truly low-resource language.
  • Kim et al. present a cross-lingual transfer learning model for POS tagging without parallel data. Parallel data is expensive to create and rarely available for low-resource languages, so this approach fills an important need.
  • Vulic et al. propose a new cross-lingual transfer method for inducing VerbNets for different languages. The method leverages vector space specialisation, an effective word embedding post-processing technique similar to retro-fitting.
  • Braud et al. propose a robust, cross-lingual discourse segmentation model that only relies on POS tags. They show that dependency information is less useful than expected; it is important to evaluate our models on multiple languages, so we do not overfit to features that are specific to analytic languages, such as English.

emnlp56

Figure 6: Anders Søgaard demonstrating the similarities between different cross-lingual embedding models at the cross-lingual representations tutorial

Summarization

The Workshop on New Frontiers of Summarization brought researchers together to discuss key issues related to automatic summarization. Much of the research on summarization sought to develop new datasets and tasks:

  • Katja Filippova (Google Research, Switzerland) gave an interesting talk on sentence compression and passage summarization for Q&A. She described how they went from syntax-based methods to Deep Learning.
  • Volkse et al. created a new summarization corpus by looking for ‘TL;DR’ on Reddit. This is another example of a creative use of distant supervision, leveraging information that is already contained in the data in order to create a new corpus.
  • Falke and Gurevych won the best resource paper award for creating a new summary corpus that is based on concept maps rather than textual summaries. The concept map can be explored using a graph-based document exploration system, which is available as a demo here.
  • Pasunuru et al. use multi-task learning to improve abstractive summarization by leveraging entailment generation.
  • Isonuma et al. also use multi-task learning with document classification in conjunction with curriculum learning.
  • Li et al. propose a new task, reader-aware multi-document summarization, which uses comments of articles, along with a dataset for this task.
  • Naranyan et al. propose another new task, split and rephrase, which aims to split a complex sentence into a sequence of shorter sentences with the same meaning, and also release a new dataset.
  • Ghalandari revisits the traditional centroid-based method and proposes a new strong baseline for multi-document summarization.

Bias

Data and model-inherent bias is an issue that is receiving more attention in the community. Some papers investigate and propose methods to address the bias in certain datasets and evaluations:

  • Chaganty et al. investigate bias in the evaluation of knowledge base population models and propose an importance sampling-based evaluation to mitigate the bias.
  • Dan Jurasky gave a truly insightful keynote about his three year-long study analyzing the body camera recordings his team obtained from the Oakland police department for racial bias. Besides describing the first contemporary linguistic study of officer-community member interaction, he also provided entertaining insights on the language of food (cheaper restaurants use terms related to addiction, more expensive venues use language related to indulgence) and the challenges of interdisciplinary publishing.
  • Dubossarsky et al. analyze the bias in word representation models and propose that recently proposed laws of semantic change must be revised.
  • Zhao et al. won the best paper award for an approach using Lagrangian relaxation to inject constraints based on corpus-level label statistics. An important finding of their work is bias amplification: While some bias is inherent in all datasets, they observed that models trained on the data amplified its bias. While a gendered dataset might only contain women in 30% of examples, the situation at prediction time might thus be even more dire.

emnlp7

Figure 7: Zhao et al.’s proposed method for reducing bias amplification

Argument mining & debate analysis

Argument mining is closely related to summarization. In order to summarize argumentative texts, we have to understand claims and their justifications. This research area had the 4th Workshop on Argument Mining dedicated to it:

  • Hidey et al. analyse the semantic types of claims (e.g. agreement, interpretation) and premises (ethos, logos, pathos) in the Subreddit Change My View. This is another creative use of reddit to create a dataset and analyze linguistic patterns.
  • Wachsmut et al. presented an argument web search engine, which can be queried here.
  • Potash and Rumshinsky predict the winner of debates, based on audience favorability.
  • Swamy et al. also forecast winners for the Oscars, the US presidential primaries, and many other contests based on user predictions on Twitter. They create a dataset to test their approach.
  • Zhang et al. analyze the rhetorical role of questions in discourse.
  • Liu et al. show that argument-based features are also helpful for predicting review helpfulness.

Multi-agent communication

Multi-agent communication is a niche topic, which has nevertheless received some recent interest, notably in the representation learning community. Most papers deal with a scenario where two agents play a communicative referential game. The task is interesting, as the agents are required to cooperate and have been observed to develop a common pseudo-language in the process.

  • Andreas and Klein investigate the structure encoded by RNN representations for messages in a communication game. They find that the mistakes are similar to the ones made by humans. In addition, they find that negation is encoded as a linear relationship in the vector space.
  • Kottur et al. show in their best short paper that language does not emerge naturally when two agents are cooperating, but that they can be coerced to develop compositional expressions.

emnlp8

Figure 8: The multi-agent setup in the paper of Kottur et al.

Relation extraction

Extracting relations from documents is more compelling than simply extracting entities or concepts. Some papers improve upon existing approaches using better distant supervision or adversarial training:

  • Liu et al. reduce the noise in distantly supervised relation extraction with a soft-label method.
  • Zhang et al. publish TACRED, a large supervised dataset knowledge base population, as well as a new model.
  • Wu et al. improve the precision of relation extraction with adversarial training.

Document and sentence representations

Learning better sentence representations is closely related to learning more general word representations. While word embeddings still have to be contextualized, sentence representations are promising as they can be directly applied to many different tasks:

  • Mekala et al. propose a novel technique for building document vectors from word embeddings, with good results for text classification. They use a combination of adding and concatenating word embeddings to represent multiple topics of a document, based on word clusters.
  • Conneau et al. learn sentence representations from the SNLI dataset and evaluate them on 12 different tasks.

These were our highlights. Naturally, we weren’t able to attend every session and see every paper. What were your highlights from the conference or which papers from the proceedings did you like most? Let us know in the comments below.




Text Analysis API - Sign up




1

Welcome to the seventh in a series of blog posts in which we use the News API to look into the previous month’s news content. The News API collected and indexed over 2.5 million stories published last month, and in this blog we’re going to use its analytic capabilities to discover trends in what the media wrote about.

We’ve picked two of the biggest stories from last month, and using just three of the News API’s endpoints (Stories, Trends, and Time Series) we’re going to cover the two following topics:

  1. The conflict brewing with a nuclear-armed North Korea and the US
  2. The ‘fight of the century’ between Conor McGregor and Floyd Mayweather

In covering both of these topics we uncovered some interesting insights. First, apparently we’re much more interested in Donald Trump’s musings on nuclear war than the threat of nuclear war itself. Also, although the McGregor fight lived up to the hype, Conor failed to capitalize on the record-breaking press coverage to launch his ‘Notorious Whiskey’.

1. North Korea

Last month, North Korea detonated a Hydrogen bomb, which was over seven times larger than any of their previous tests. This created an increasing worry that conflict with a nuclear-armed nation is now likely. But using the News API, we can see that in the English-speaking world, even with such a threat looming, we still just can’t get enough of Donald Trump.

Take a look below at the daily volume of stories with ‘North Korea’ in the title last month, which we gathered with the News API’s Time Series endpoint. The graph below shows the volume of stories with the term ‘North Korea’ in the title across every day in August. You can see that the English-speaking media were much more interested with Trump’s ‘fire and fury’ comment at the start of August than they were with North Korea actually detonating a Hydrogen bomb at the start of September.



We guessed that this is largely due to publishers trying to keep up with the public’s insatiable appetite for any Donald Trump-related news. Using the News API, we can put this idea to the test, by analyzing what content about North Korea people shared the most over August.

We used the Stories endpoint of the News API to look at the stories that contained ‘Korea,’ in the title that had the highest engagement rates across social networks to understand the type of content people are most likely to recommend in their social circles which gives a strong indication into readers’ opinions and interests. Take a look below at the most-shared stories across Facebook and Reddit. You can see that the popular content varies across the different networks.

Facebook:

Untitled design (12)

 

  1. Trump to North Korea: U.S. Ready to Respond With ‘Fire and Fury,’ The Washington Post. 118,312 shares.
  2. China warns North Korea: You’re on your own if you go after the U.S.’ The Washington Post. 94,818 shares.
  3. Trump threatens ‘fury’ against N Korea,’ BBC. 69,098 shares.

Reddit:

Untitled design (10)

 

  1. Japanese government warns North Korea missile headed toward northern Japan,’  CNBC. 119,075 upvotes.
  2. North Korea shaken by strong tremors in likely nuclear test,’ CNBC. 61,088 upvotes.
  3. Japan, US look to cut off North Korea’s oil supply,’ Nikkei Asian Review. 59,725 upvotes.

Comparing coverage across these three social networks, you can see that Trump features heavily on the most popular content about Korea on Facebook, while the most-upvoted content on Reddit tended to be breaking news with a more neutral tone. This is similar to the patterns we observed with the News API in a previous media review blog, which showed that Reddit was much more focused on breaking stories on popular topics than Facebook.

So now that we know the media focused its attention on Donald Trump, we can ask ourselves, what were all of these stories about? Were these stories talking the President down, like he always claims? Or were they positive? Using the sentiment feature of the News API’s Trends endpoint, we can dive into the stories that had both ‘Trump’ and ‘Korea’ in the title, and see what sentiment is expressed in the body of the text.

From the results below, you can see that over 50% of these articles contained negative sentiment, whereas a little over 30% had a positive tone. For all of the President’s – shall we say, questionable – claims, he’s right about one thing: popular content about how he responds to issues is overwhelmingly negative.


The Superfight – how big was it?

We’re based in Ireland, so having Conor McGregor of all people taking part in the ‘fight of the century’ last month meant that we’ve heard about pretty much nothing else. We can use the News API to put some metrics on all of the hype, and see how the coverage compared to the coverage of other sporting events. Using the Time Series endpoint, we analyzed the impact of the fight on the volume of stories last month. Since it analyzes the content of every news story it gathers, the News API can show us how the volume of stories about every subject fluctuates over time.

Take a look at how the volume of stories about boxing skyrocketed in the build up to and on the weekend of the fight:



You can see that on the day of fight itself, the volume of stories that the News API classified as being about boxing increased, almost by a factor of 10.

To figure out just how big this hype was in the boxing world, we compared the volume of stories published about boxing in the time period surrounding the ‘fight of the century’ and another boxing match which at the time received a lot of hype, the WBA/IBF world heavyweight title bout last April between Anthony Joshua and Wladimir Klitschko. In order to do this, we analyzed the story volume from the two weeks before and after each fight and plotted them side by side. This allows us to easily compare the media coverage on the day of the fight as well as its build-up and aftermath. Take a look below at the results below:



You can see that the McGregor-Mayweather fight totally eclipses the Joshua-Klitschko heavyweight title fight. But it’s important to give context to the data on this hype by comparing it with data from other sports.

It’s becoming almost a point of reference on these News API media review blogs to compare any trending stories to stories in the World Soccer category. This is because the daily volume of soccer stories tends to be consistently the largest of all categories, so it’s a nice baseline to use to compare story volumes. As you can see below, the hype surrounding the ‘fight of the century’ even prompted more boxing stories than soccer stories, which is quite a feat. Notice how only four days after the fight, when boxing was back to its normal level and soccer stories were increasing due to European transfer deadline day looming, there were 2,876 stories about soccer compared with 191 stories about boxing.

You might remember Conor McGregor launched his ‘Notorious Whiskey’ in the press conference following the fight. This was the perfect time for McGregor to launch announce a new product – right at the pinnacle of the media coverage. If you’re wondering how well he leveraged this phenomenal level of publicity for his new distilling career, we used the News API to look into that too. Take a look below at volume of stories that mentioned the new whiskey brand. It looks like mentions of ‘Notorious Whiskey’ have disappeared totally since the weekend of the fight, leaving us with this odd-looking bar chart. But we doubt that will bother Conor at the moment, considering the $100m payday!

That covers our quick look into the News API’s data on two of last month’s stories. The News API gathers over 100,000 stories per day, and indexes them in near-real time. This gives you a stream of enriched news data that you can query. So try out the demo or click on the link below for a free trial to use our APIs for two weeks.






News API - Sign up





0

PREVIOUS POSTSPage 1 of 19NO NEW POSTS