Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78

Last week, Snapchat unveiled a major redesign of their app that received quite a bit of negative feedback. As a video-sharing platform that has integrated itself into users’ daily lives, Snapchat relies on simplicity and ease of use. So when large numbers of these users begin to express pretty serious frustration about the app’s new design, it’s a big threat to their business.

scupdate

You can bet that right now Snapchat are analyzing exactly how big a threat this backlash is by monitoring the conversation online. This is a perfect example of businesses leveraging the Voice of their Customer with tools like Natural Language Processing. Businesses that track their product’s reputation online can quantify how serious events like this are and make informed decisions on their next steps. In this blog, we’ll give a couple of examples of how you can dive into online chatter and extract important insights on customer opinion.


This TechCrunch article pointed out that 83% of Google Play Store reviews in the immediate aftermath of the update gave the app one or two stars. But as we mentioned in a blog last week, star rating systems aren’t enough – they don’t tell you why people feel the way they do and most of the time people base their star rating on a lot more than how they felt about a product or service.

To get accurate and in-depth insights, you need to understand exactly what a reviewer is positive or negative about, and to what degree they feel this way. This can only be done effectively with text mining.

So in this short blog, we’re going to use text mining to:

  1. Analyze a sample of the Play Store reviews to see what Snapchat users mentioned in reviews posted since the update.
  2. Gather and analyze a sample of 1,000 tweets mentioning “Snapchat update” to see if the reaction was similar on social media.

In each of these analyses, we’ll use the use the AYLIEN Text Analysis API, which comes with a free plan that’s ideal for testing it out on small datasets like the ones we’ll use in this post.

 

What did the app reviewers talk about?

As TechCrunch pointed out, 83% of reviews since the update shipped received one or two stars, which gives us a high-level overview of the sentiment shown towards the redesign. But to dig deeper, we need to look into the reviews and see what people were actually talking about in all of these reviews.

As a sample, we gathered the 40 reviews readily available on the Google Play Store and saved them in a spreadsheet. We can analyze what people were talking about in them by using our Text Analysis API’s Entities feature. This feature analyzes a piece of text and extracts the people, places, organizations and things mentioned in it.

One of the types of entities returned to us is a list of keywords. To get a quick look into what the reviewers were talking about in a positive and negative light, we visualized the keywords extracted along with the average sentiment of the reviews they appeared in.

From the 40 reviews, our Text Analysis API extracted 498 unique keywords. Below you can see a visualization of the keywords extracted and the average sentiment of the reviews they appeared in from most positive (1) to most negative (-1).

First of all, you’ll notice that keywords like “love” and “great” are high on the chart, while “frustrating” and “terrible” are low on the scale – which is what you’d expect. But if you look at keywords that refer to Snapchat, you’ll see that “Bitmoji” appears high on the chart, while “stories,” “layout,” and “unintuitive” all  appear low down the chart, giving an insight into what Snapchat’s users were angry about.

 

How did Twitter react to the Snapchat update?

Twitter is such an accurate gauge of what the general public is talking about that the US Geological Survey uses it to monitor for earthquakes – because the speed at which people react to earthquakes on Twitter outpaces even their own seismic data feeds! So if people Tweet about earthquakes during the actual earthquakes, they are absolutely going to Tweet their opinions of Snapchat updates.

To get a snapshot of the Twitter conversation, we gathered 1,000 Tweets that mentioned the update.To gather the Tweets, we ran a search on Twitter using the Twitter Search API (this is really easy –  take a look at our beginners’ guide to doing this in Python).

After we gathered our Tweets, we analyzed them with our Sentiment Analysis feature and as you can see, the Tweets were overwhelmingly negative:  

Quantifying the positive, negative, and neutral sentiment shown towards the update on Twitter is useful, but using Text Mining we can go one further and extract the keywords mentioned in every one of these Tweets. To do this, we use the Text Analysis API’s Entities feature.

Disclaimer: this being Twitter, there was quite a bit of opinion expressed in a NSFW manner 😉

 

The number of expletives we identified as keywords reinforces the severity of the opinion expressed towards the update. You can see that “stories” and “story” are two of the few prominently-featured keywords that referred to feature updates while keywords like “awful” and “stupid” are good examples of the most-mentioned keywords in reaction to the update as a whole.

It’s clear that using text mining processes like sentiment analysis and entity extraction – can provide a detailed overview of public reaction to an event by extracting granular information from product reviews and social media chatter.

If you can think of insights you could extract with text mining about topics that matter to you, our Text Analysis API allows you to analyze 1,000 documents per day free of charge and getting started with our tools couldn’t be easier – click on the image below to sign up.






Text Analysis API - Sign up





0

Online review sites are the world’s repository of customer opinion – every day, hundreds of thousands of customers give publicly available feedback on their experiences with businesses. With customer opinion available on a scale like this, anyone can generate insights about their business, their competitors, and potential opportunities.

But to leverage these sites like this, you need to understand what is being talked about positively and negatively in the text of hundreds or thousands of reviews. Since analyzing that amount of reviews manually would be far too time consuming, most people don’t consider any kind of quantitative analysis beyond looking at the star ratings, which are too vague and can be frequently misleading.

So in this blog, we’re going to show you how to use Text Mining to quickly generate accurate insights from thousands of reviews. For this blog, we’re going to scrape and analyze restaurant reviews from TripAdvisor and show you how easy it is to build a robust sentiment analysis workflow without writing any code using import.io and the AYLIEN Text Analysis Add-on for Google Sheets.

We’ll break the process down into three easy-to-follow steps:

  1. We’ll show you how to use import.io to scrape reviews from TripAdvisor
  2. We’ll use the AYLIEN Text API Google Sheets Add-on to analyze the sentiment expressed in each review toward 13 aspects of the dining experience.
  3. We’ll show you the results of our sample analysis

As we mentioned, neither of the tools we’ll use require coding skills, and you can use both of them for free.

 

Why are star reviews not enough on their own?

Take a look at the difference between these three-star reviews (which are for the same branch of the same restaurant chain):

 

Screenshot (911)Screenshot (907)

 

From looking at these reviews, you can spot two important things with the star ratings and the review texts:

  1. Even though the star rating is the same, one of the reviews is positive, the other is negative. This gap between the star rating and what the reviewer really thought is part of the reason Netflix recently ditched the star review system.
  2. The text review allows you to see why the review is positive or negative – the specific aspects that made their dining experience positive or negative.

So to get an accurate analysis of customer opinion from reviews, you need to read the text of every review. The problem here is that doing this at scale is extremely time consuming and pretty much impossible. But we can solve this problem using Text Analytics and Machine Learning.

 

How to use import.io to scrape reviews from TripAdvisor

In order to find out what people are saying about businesses, we first need to gather the reviews. For this blog, we decided to analyze customer reviews of Texas Roadhouse, ranked by Business Insider as America’s best restaurant chain.

We chose to compare reviews on their branch in Gatlinburg, Tennessee with the branch in Dubai – as this might let us see how customers in diverse regions are responding to the Texas Outhouse offering. Each of these branches had more than 1,000 reviews, which gives us a generous amount of data to analyze.

 

Untitled design (12)

 

Usually, gathering data like this would involve writing code to scrape the review sites, but import.io makes this task a lot easier – they allow you to scrape sites by simply pointing and clicking at the data you want. You can sign up for a free trial here and see a handy introductory video here (but we’ll walk you through the process below).

Once you’ve picked which restaurant you want to analyze and you’ve signed up for a trial with import.io, open up the restaurant’s TripAdvisor page in import.io. To do this, just enter the URL in the New Extractor input box. If you point and click on the text of a review, import.io will scrape all of the reviews on the page and save them for you.

 

Screenshot (918)

Screenshot (917)

 

You’ve now scraped the reviews from a single page. But since you’ll probably want a lot more than the 10 reviews on each TripAdvisor page, we’ll show you how you scrape a few hundred in one go.

Scraping hundreds of reviews at once

You may notice that when you are browsing reviews of a restaurant on TripAdvisor, the page url changes every time you select the next 10 reviews – it adds “-or10” for the next ten results, “-or20-” for the following 10, and so on. You can see it in the URL right before the restaurant name.

In our Texas Roadhouse example, the URL goes from this:
https://www.tripadvisor.ie/Restaurant_Review-g295424-d2310358-Reviews-Texas_Roadhouse-Dubai_Emirate_of_Dubai.html

To this:
https://www.tripadvisor.ie/Restaurant_Review-g295424-d2310358-Reviews-or10-Texas_Roadhouse-Dubai_Emirate_of_Dubai.html

Import.io allows us to scrape numerous webpages at once if we upload a list of URLs on a spreadsheet. So to gather 1,000 restaurant reviews, we need to upload a spreadsheet with 100 of these URLs, with the -”or10” increasing by 10 each time.

To make your life a little easier, we’ll share the simple, six-step workaround we used for this with you here:

Step 1: Select the URL of the second page of results containing reviews of the restaurant you want to analyze. In our case it’s https://www.tripadvisor.ie/Restaurant_Review-g295424-d2310358-Reviews-or10-Texas_Roadhouse-Dubai_Emirate_of_Dubai.html

Step 2: Open up a spreadsheet and fill the first three cells (A1, B1, and C1) with the URL, but only up to “-or10” – copy and paste the remainder of the URL to somewhere else for now (in our case we’ll cut “-Texas_Roadhouse-Dubai_Emirate_of_Dubai.html” and paste it to another cell).

Step 3: Edit the cells in B1 and C1 to end with “-or20” and “-or30”, respectively. Then select these cells, and extend the selected cells until you have 100 rows covered. Excel or Google Sheets will then follow the pattern you have set in the first three.

excel_tip_edited

Step 4: since this is not the completed URL, you’ll need to add the end of the URL you have from step 2 to the end of the text in each cell. You can do this by selecting a cell in row A and typing “A1&[the rest of your URL],” and extending that cell’s format downwards again.


Screenshot (965)

Step 5: copy and paste the values of this new column into column A, and save your spreadsheet. Your spreadsheet should now have one column with 100 rows.

Step 6: Open up import.io and create a new extractor, and open up settings. Click on Import URLs, and select the spreadsheet with your URLs and save them. Once you click Run URLs, import.io will start scraping the 1,000 reviews from the URLs you’ve given it. Once it’s done, download the results, and open the file in Google Sheets.

Screenshot (921)

 

Analyzing the Sentiment of Reviews

So at this point, we’ve gathered 1,000 reviews of each Texas Roadhouse branch, with each review containing a customer’s feedback about their experience in the restaurant. In every review, customers express positive, negative, and neutral sentiment toward the various aspects of their experience.

AYLIEN’s Aspect-Based Sentiment Analysis feature detects the aspects mentioned in a piece of text, and then analyzes the sentiment shown toward each of these aspects. In this blog, we’re analyzing restaurants but you can also use this feature to analyze review of hotels, cars, and airlines. In the restaurants domain, the Aspect-Based Sentiment Analysis feature detects mentions of seven aspects.

Copy of 600px x 300px – Untitled Design (3)

 

Using our Text Analysis API is easy with the Google Sheets Add-on, which you can download for free here (the Add-on comes with 1,000 credits free so you can test it out). You can complete the analysis by following these three easy steps:

Step 1: Once you’ve downloaded the Add-on, it will be available in the Add-ons menu in your Google Sheets toolbar. Open it up by selecting it and clicking Start.

Step 2: Before you begin the Aspect-based Sentiment Analysis of your reviews, first select the option from the Analysis Type menu, then select all of the cells that contain your reviews.

Step 3: To begin the sentiment analysis, click Analyze. The Text API will then extract the aspects mentioned in each review one by one, and print them in three columns next to the review – Positive, Negative, and Neutral. These results will be returned to you at a rate of about three per second, so our 2,000 reviews should take around ten minutes to analyze.

 

absa_restaurant_tennessee

Results of the Aspect-Based Sentiment Analysis

At this point, each line of your spreadsheet will contain a column of the reviews you gathered, a column of the aspects mentioned in a positive tone, one with the aspects mentioned in a negative tone, and one with aspects mentioned in a neutral tone. To get a quick look into our data, we put together the following visualizations by simply using the spreadsheet’s word counting function.

First off, let’s take a look at the most-mentioned aspects in all of the reviews we gathered. To do this, all you need to do is separate every aspect listed in Google Sheets into its own cell using a simple function, and then use a formula to count them.

Screenshot (959)

 

To put each aspect mentioned into its own cell, we’ll use the Split text to columns function, in the Data toolbar. This function will move every word in a cell into a cell of its own by splitting the cell horizontally – that is, if a cell in column A has three words, the Split text function will move the second two words into the adjacent cells in columns B and C.

From the pie chart, we can see that the food and staff alone accounted for almost two thirds of the total mentions, and after that there’s a bit of a drop off. After these aspects, customers in these restaurants were concerned about was how busy the restaurant was and the value of the meal.

Knowing which aspects of the dining experience people were most likely to leave reviews about is useful, but we can go further and analyze the sentiment attached to each aspect. Let’s take a look at the sentiment attached to each aspect in each of the Texas Roadhouse branches.

To do this, use Google Sheets’ COUNTIF formula to count every time the Text API listed an aspect in the positive, negative, and neutral columns. Do this by creating a table with each aspect as rows and Positive, Negative, and Neutral as columns, and use the following formula: =COUNTIF(the range of cells that contain the aspects in each sentiment,”*aspect*”).

After you’ve entered the formula, fill it out correctly, like in the example below, where you can see the formula filled out to count the amount of times food is mentioned positively – =COUNTIF(B1:B988,“*food*”).

Screenshot (950)

 

Once you’ve done this, fill in the results on a table like the one below, and then insert a chart from the Insert tab.

Screenshot (953)

We chose a stacked bar chart, as it allows us to get a quick grasp of what aspects people were interested in and how they felt about each aspect. First off, take a look at the sentiment shown to each aspect by the reviewers of the Dubai branch. You can see that the reviews are very positive:

When we compare the reviews of the Dubai branch above with the Tennessee reviews, we can see immediately that the American branch received more positive reviews than its Dubai counterpart:

Interestingly, we can also see from the volume of the mentions of each aspect that customers in Dubai were more concerned with value than their American counterparts, where reviewers paid more attention to the restaurant staff (with most of this extra attention being negative).

 

These are just a few things that jumped out at us after a sample analysis of a couple of restaurants. If you want to get started leveraging TripAdvisor (or another review site) for your own research using the steps in this blog, sign up for a free trial with import.io here, and download our Google Sheets Add-on here (there’s no sign-up required for the Add-on and it comes with free credits so you can test it out).






Text Analysis API - Sign up




0

It’s now the end of an eventful year that saw the UK begin negotiations to leave the EU, the fight of the century between a boxer and a mixed martial artist, and the discovery of alternative facts. The world’s news publishers reported all of this and the countless other events that shaped 2017, leaving a vast record of what the media was talking about right through the year.

Using Natural Language Processing, we can dive into this record to generate insights about topics that interest us. Our News API has been hard at work gathering, analyzing, and indexing over 25 million news stories in near-real time over 2017. The News API extracts and stores dozens of data points on every story, from classifying the subject matter to analyzing the sentiment, to listing the people, places, and things mentioned in every one.

This enriched content provides us with a vast dataset of structured data about what the world was talking about throughout the year, allowing us to take a quantitative look at the news trends of 2017.

Using the News API, we’re going to dive into two questions on topics that dominated last year’s news coverage:

  1. What was the coverage of Donald Trump’s first year in office like?
  2. What trends affected sports coverage – consistently the most popular category – in 2017?

 

Trump’s first year in office

How much did the media publish?

Any review of 2017’s news has to begin with Donald Trump and his first year in office as President. To begin with, we wanted to see how the US President was covered over the course of the year, to see which events the media covered the most. To do this, we used the Time Series endpoint to analyze the daily volume of stories that mentioned Trump in the title.

Take a look at what the News API found:

 

From this chart, you can see that the media are generally less interested in Trump now than they were during the first month or two of his presidency. Despite the coverage of the Charlottesville protests, the media fixation on Trump is slowly tapering off.

 

How did sentiment in the coverage of Trump vary over the year?

Knowing what the media was the most interested in about the President is useful information, but we can also track the sentiment expressed in each one of these stories, and see how the overall sentiment polarity changed over time.

Again using the Time Series endpoint, we can do this. Take a look at what the News API found:

You can see that the News API detected the most negative sentiment in stories about Trump around the time of his call with a fallen US soldier, where he reportedly said to the soldier’s wife, “he knew what he was signing up for”. The most positive sentiment was detected around the time of Trump’s speech in Riyadh, and as the NFL kneeling controversy began to expand.

You will also notice spikes in positive sentiment in stories about Trump around both his administration’s repeal of DACA, and as more and more NFL players joined in the kneeling protests. We think that since both of these spikes follow shortly after the events that the coverage is most likely about the reactions or backlash towards these developments.

 

What other things were mentioned in stories about Trump?

So we know how both the volume of stories about Trump and their sentiment varied over time. But knowing exactly what other people, organizations, and things were mentioned in these stories across the year would let us see what all of these stories were about.

The News API extracts entities mentioned in in every story it analyzes. Using the Trends endpoint, we can search for the 100 entities that were most frequently mentioned in stories about Trump in 2017. These entities are visualized below.

Perhaps unsurprisingly, we can see that Trump coverage was dominated by his campaign’s and administration’s involvement with Russia. But what is quite remarkable is the scale to which it dominated the coverage that Russia was mentioned in more stories with ‘Trump’ in the title than the US itself.

 

What were the most-shared stories about Trump in 2017?

Seeing which stories were shared the most on social networking sites can be very interesting. It can also yield some important business insights as the more a story is shared, the more value it generates for advertisers and publishers.

We can do this with the News API by using the Stories endpoint. Since Facebook consistently garners the most shares of news stories of all the social networks, we returned the top three stories:

  1. Trump Removes Anthony Scaramucci From Communications Director Role,” The New York Times – 1,061,494 shares.
  2. Trump announces ban on transgender people in U.S. military,” The Washington Post – 696,341 shares.
  3. Trump admin. to reverse ban on elephant trophies from Africa,” ABC News – 638,917 shares.

 

2017 in Sports Coverage

Sports is the subject that the media writes the most about, by quite a bit. This is reflected in the fact that the News API gathered over five million stories about sports in 2017, more than any other single subject category.

To make sense of this content at this scale, we need to first understand the subject matter of each story. To enable us to do this, the News API classifies each story according to two taxonomies.

To analyze the most popular sports, we used the Time Series endpoint to see how the daily volume of stories about the four most-popular sports varied over time. We searched stories that the News API classified as belonging to the categories Soccer, American Football, Baseball, and Basketball in the advertising industry’s  IAB-QAG taxonomy. To narrow our search down a bit, we decided to look into Autumn, the busiest time of year for sports.

Take a look what the News API returned:

We can see that the biggest event that caused a spike in stories was Mike Pence’s out-of-the-ordinary appearance at an NFL game as the NFL kneeling protests expanded, a game from which he left after the players kneeled during the playing of the national anthem.

Other than this, the biggest spike in stories was clearly caused by the closing of the English transfer window on the last day of August, showing the dominant presence of soccer in the world’s media outlets.

 

Who and what were the media talking about?

Being able to see the spikes in the volume of sports stories around certain events is a useful resource to have, but we can use the News API to see exactly what people, places, and organizations were talked about in every one of the over 25 million stories it gathered in 2017.

To do this, we again used the Trends endpoint to find the most-mentioned entities in sports stories from 2017. Take a look at what the News API found:

You can immediately see the dominance of popular soccer clubs in the media coverage, but locations that host popular NFL and NBA teams are also featured prominently. However, soccer has a clear lead over its American competitors in terms of media attention, probably due to the global reach of soccer.

 

What were the most-shared sports stories on Facebook in 2017?

The Time Series endpoint showed us that the NFL kneeling protests were the most-covered sports event of 2017. Using the News API, we can also see how many times each one of the over 25 million stories was shared across social media.

Looking at the top three most-shared sports stories on Facebook, we can see that the kneeling protests were the subject of two of them. This shows us that the huge spike in story volume about these protests were responding to genuine public demand – people were sharing these stories with their friends and followers online.

  1. Wife of ‘American Sniper’ Chris Kyle Just Issued Major Challenge to NFL – Every Player Should Read This,” Independent Journal-Review – 830,383 shares.
  2. Vice President Mike Pence leaves Colts-49ers game after players kneel during anthem,” Fox News – 829,466 shares.
  3. UFC: Dana White admits Mark Hunt’s UFC career could be over,” New Zealand Herald – 772,926 shares.

 

Use the News API for yourself

Well that concludes our brief look back at a couple of the biggest media trends of 2017. If there are any subjects of interest to you, try out our free two-week trial of the News API and see what insights you can extract. With the easy-to-use SDKs and extensive documentation, you can make your first query in minutes.

 






News API - Sign up





0

Being able to leverage news content at scale is an extremely useful resource for anyone analyzing business, social, or economic trends. But in order to extract valuable insights from this content, we sometimes need to build analysis tools that help us understand it.

To serve the needs of everyone who needs a simple, end-to-end solution for this complex task, we’ve put together a fully-functional example of a RapidMiner process that sources data from the AYLIEN News API and analyzes this data using some of RapidMiner’s operators.

aylien-rapidminer-banner

What can you do with the News API in RapidMiner?

With news content now accessible at web scale, data scientists are constantly creating new ways to generate value with insights from news content that were previously almost impossible to extract. Every month, our News API gathers millions of stories in near-real time, analyzes every news story as it is published, and stores each of them along with dozens of extracted data points and metadata.

Equipped with this structured data about what the world’s media is talking about, RapidMiner users can leverage the extensive range of tools the studio has to offer, including:

  • 1,500+ built-in operations & Extensions to dive into your data
  • 100+ data modelling & machine learning operators
  • Advanced visualization tools.

News_API_RM_vizUsing a 3-D scatter plot to visualize news data with four variables in the RapidMiner Studio

How do I get started with the News API process?

For the this blog we’re going to showcase a example of how you can use our News API within RapidMiner to build useful content analysis processes that aggregate and analyze news content with ease. In this example we’ve picked a fun little example that analyzes articles from TechCrunch and builds a classification model to predict which reporter wrote any new article it is shown (protip: you can use the same model to pick which TechCrunch journalist you should target for your pitch!). We hope this blog might spark some creative ideas and use cases around combining RapidMiner and our News API.

This sample process consists of two main steps:

  1. Gathering enriched news content from the News API using the Web Mining extension
  2. Building a classification model by using RapidMiner’s Naive Bayes operator.

If you are unfamiliar with RapidMiner, there are some great introductory videos and walkthroughs for beginners on their YouTube channel.

So let’s get started!

We’ve made it really easy to get started with the News API and RapidMiner, download this pre-built process and open it with RapidMiner. Next, grab your credentials for the News API by signing up for our free two-week trial.

Once you’ve downloaded the process and opened it up with RapidMiner, you’ll see the main operators outlined in the Process tab. You will see that there are seven operators in total, the first three gather data from the News API while the last four train the classifier.

Screenshot (853)

To make your first calls to the News API, the first thing you need to do is build your search criteria. In order to build your News API query click on the Set Macros operator in the top left of your console. Once you’ve selected the operator, clicking on the Edit List button in the Parameters tab will show you the list of parameters for your News API query. Enter your API credentials (your API key and application ID) that you obtained from the News API developer portal) when you signed up and configure your search parameters – check out the full list of query parameters in our News API documentation to build your search query.

Screenshot (865)

The purpose of this blog is to build a classifier that will predict which TechCrunch journalist wrote an article. In order to do this, we first need to teach the model by gathering relevant training data from TechCrunch. To get this data, we built a query that searched for every article published on on the site in the past 30 days and returned the author of each one, along with the contents of the articles they wrote. The News API can return up to 100 results at a time, but since we wanted more than 100 articles, we used pagination to iterate on the search results for five pages, giving us 500 results. You can see what query we used in the screenshot above.

Importantly, after you have defined these parameters in the Set Macros operator, you’ll need to make the same changes by editing the query list in the Get Page operator within the Process Loop. To do this, double-click on the Loop icon in the Process tab, then double-click the Get Page icon, and select the Edit List button next to Query Parameters.

Screenshot (855)

When you’re entering the parameters, be sure to enter every parameter you entered in the previous window and follow the convention already set in the list (entering the parameter in the “%{___}” format).

News API Results

Once you have defined your parameters in both lists, hit the Run (play) button at the top of the console and let RapidMiner run your News API query. Once it has finished running, you can view the results in the Results window. Below you can see a screenshot of the enriched results with the dozens of data points that the News API returns.

Screenshot (877)

Having access to this enriched news content in Rapidminer allows you to extract useful insights from this unstructured data. After running the analysis you can browse the results of the search using simple visualizations to show data on results like sentiment, or as in the graph below, authorship, which shows us which authors have published the most articles in the time period we set.

Screenshot (883)

 

Training a Classifier

For the sample analysis in this blog, we’re building a classifier using RapidMiner’s Naive Bayes operator.

Naive Bayes is a common algorithm used in Machine Learning for data classification. You can read more about Naive Bayes in an explainer for novices blog we wrote which talks you through how the algorithm works. Essentially, this classifier will guess which author new articles belong to by learning from features in the training data – the news content we retrieved from our News API results. By analyzing the most common features in the articles from each author in these results, the model will learn that different words and phrases are more likely to be in articles from each author.

For example, take a look below at our how our classifier has learned which writers are most likely to talk about ‘cryptocurrency’. You can test how your classifier by selecting the Attribute button in the top left corner.

Screenshot (879)

Results

Once the process is fully run, it will retrieve and process the news content, and train a Naive Bayes classifier that given the body of an article, tries to predict who the likely authors for that article are, from among all TechCrunch journalists.

Additionally, RapidMiner will also evaluate this classifier for us on a held out subset of the data we retrieved from the News API, by comparing the true labels (known authors) to the model’s predictions (predicted authors) on the test set, and providing us with an accuracy score and a confusion matrix based on the same:


Screen Shot 2017-12-18 at 5.29.56 PM

There are many ways to improve the performance of this classifier, for example by using a more advanced classification algorithm like SVM instead of Naive Bayes. In this post, our goal was to show you how easy it is to retrieve news content from our News API and load it into RapidMiner for further analysis and processing.

Things to try next:

  • Try changing your News API query to repeat this process for journalists from a different news outlet
  • Try using a more powerful algorithm such as SVM or Logistic Regression (RapidMiner includes implementations for many different classifiers, and you can easily replace them with one another)
  • Try to apply a minimum threshold on the number of articles that must exist for each author that the model is trained on

This process is just one simple example of what RapidMiner’s analytic capabilities can perform on enriched news content. By running the first three operators on their own, you can take a look at the enriched content that the News API generates and begin to leverage RapidMiner’s advanced capabilities on an ever-growing dataset of structured news data.

To get started with a free trial of the News API, click on the link below, and be sure to check back on our blog over the coming weeks to see some sample analyses and walkthroughs.




News API - Sign up




 

1

Over the past few weeks, the hype about Bitcoin has reached fever pitch, as the cryptocurrency’s rise in price accelerated and the Dollar value of one Bitcoin crossed $10,000 (two weeks later, it’s at over $12,000). Considering you could buy one Bitcoin for $200 in 2015, this is pretty impressive.

 

2000px-Bitcoin_logo.svg

 

But how can we explain this rise? And is it just mania?

A great article in The Atlantic talked about how all currencies are a consensual decision – from a bag of beads to modern banknotes, if a large enough group of people decide something is a currency, then it becomes one.

This is one way to explain the phenomenal rise in value of Bitcoin – from news reports on Russian bots influencing the 2016 election to people noticing how Facebook tracks your behaviour to sell you ads, the average person in the street now knows far more about digital technology than they did in 2009, when Bitcoin was launched. When this is coupled with a popular distrust of banks, it’s easy to see how Bitcoin gains its intrinsic value.

So if the value of Bitcoin is dependent on what large groups of people think about cryptocurrency in general, then understanding what large groups of people are reading about Bitcoin is important because the media coverage of Bitcoin informs buying decisions.. Last month, our News API gathered, analyzed, and indexed 2.6 million news stories as they were published. In this blog, we’re going to look into these stories to see what the media we saying about Bitcoin in November.

We’re going to look at three things:

  • What is the scale of this hype and how is it accelerating?
  • What concepts do the media talk about in stories about Bitcoin, and have the popular concepts changed since the hype has grown?
  • Has the media started to express more positive, negative, or neutral sentiment about Bitcoin since its price shot up?

 

How big is the hype?

First of all, we need to quantify all of this media attention to see how big the hype actually is by finding exactly how many stories were published about Bitcoin last month and how this compares to previous months.


You can also see that over November, the media interest in Bitcoin increased as the cryptocurrency’s value grew (despite a dip over the Thanksgiving weekend), with this interest peaking on the first day that the value of Bitcoin hit the 10,000-dollar mark. So we can see the media hype was focused on Bitcoin crossing the $10,000 milestone, rather than any definite indicators of rising value in the future.

 

What else are the media talking about when they talk about Bitcoin?

Knowing the scale of hype about Bitcoin is useful, but knowing what was being talked about in these almost 14,000 stories would let us look even deeper into the Bitcoin saga. When our News API indexes a story, it analyzes and stores dozens of data points, one of which is a list of all other concepts mentioned in the story. Having access to this list for every one of the millions of stories our News API gathers every month gives us a really useful dataset to query.

So we decided to analyze this in two periods – June and November. This will allow us to see if the media has started talking about other subjects since the hype really started taking off in the Autumn.
To analyze this, we used the Trends endpoint to return the most-mentioned concepts in stories with “Bitcoin” in the title. Take a look below and see what else was mentioned.

 

You can see that the descriptive concepts are the most popular – Ethereum, blockchain, and Coinbase all put Bitcoin in context, and these are all concepts that would be mentioned when talking about Bitcoin in general. This would be very useful if we were looking for related stories, but for our analysis we need to look past these most-mentioned concepts and pay attention to what else was talked about.

Importantly, Japan is mentioned prominently, prompted by the country’s largest Forex market opening up to Bitcoin trading.

In contrast with this, looking at the results of the same search in November’s stories, we can see new concepts being mentioned.

 

Jamie Dimon and CME Group are two of the most-mentioned concepts here, resulting from JP Morgan Chase’s decision to start offering trades on Bitcoin futures (from CME), despite their CEO (Dimon) publicly declaring only six weeks earlier that he’d fire anyone “stupid enough” to deal in the cryptocurrency.

Remember that November’s story volume is 500% greater than June’s. If November’s story volume was driven by US-based traders beginning to deal in Bitcoin, while June’s story volume included Japanese traders’ much earlier adoption of the cryptocurrency, this gives us a hint that perhaps a lot of the hype around Bitcoin is focused on the news of a few well known financial institutions, which interestingly enough are all based in the US.

 

What was the sentiment of the coverage?

So we know the volume of the Bitcoin coverage and what was being talked about, but knowing whether all of this coverage was positive, negative, or neutral would let us understand the sentiment shown toward this hype.

Our News API analyzes the sentiment of every story it gathers, so using the Time Series Endpoint again, we can analyze the sentiment of stories with “Bitcoin” in the title over the past few months.

 

You can see that the sentiment split is roughly the same since June except for in August, when the coverage became more favourable, and in September, when it became more negative. This spike in negative coverage was likely caused by Jamie Dimon’s public remarks about the cryptocurrency being a fraud.

Interestingly, you can see the coverage gets much more positive in November, when the coverage exploded.

If you want to analyze the coverage of Bitcoin in greater detail than our brief layman’s overview here, you can start making queries to our News API in minutes, even without writing a line of code. Sign up for a two-week trial free of charge, with no card details required by clicking on the image below.






News API - Sign up




0

It’s an exciting time here at AYLIEN – in the past couple of months, we’ve moved office, closed a funding round, and added six people to the team. We’re delighted to announce our most recent hire and our first Chief Architect and Principal Engineer, Hunter Kelly.

Hunter

The AYLIEN tech infrastructure has grown to quite a scale at this point. In addition to serving over 30,000 users with three product offerings, we’re also a fully-functional AI research lab that houses five full-time researchers, who in turn feed their findings back into the products. With such a complex architecture and backends that have highly demanding tasks, bringing an engineer with the breadth and quality of experience that Hunter has is a huge boost to us as we move into the next phase of our journey.

At first glance, Hunter’s career has followed a seemingly meandering path through some really interesting companies. After graduating from UC Berkeley in the 90s, he joined the Photoscience Department at Pixar in California, became one of the first engineers in Google’s Dublin office, and at NewBay, he designed and built a multi-petabyte storage solution for handling user-generated content, still in use by some of the world’s largest telcos today. Hunter is joining us from Zalando’s Fashion Insight Centre, where as the first engineer in the Dublin office he kicked off the Fashion Content Platform, which was shortlisted as a finalist in the 2017 DatSci Awards.

The common thread in those roles, while perhaps not obvious, is data. Hunter brings this rich experience working on data, both from an engineering and data science perspective, to focus on one extremely important problem – how can we leverage data to solve our hardest problems?

This question is central to AI research, and Hunter’s expertise is a perfect fit with AYLIEN’s mission to make Natural Language Processing hassle-free for developers. Our APIs handle the heavy lifting so developers can leverage Deep NLP in a couple of lines of code, and the ease with which our users do this is down to the great work our science and engineering teams do. Adding Hunter to the intersection of these teams will add a huge amount to our capabilities here and we’re really excited about the great work we can get done.

Here’s what Hunter had to say about joining the team:

“I’m really excited to be joining AYLIEN at this point in time. I think that AI and Machine Learning are incredibly powerful tools that everyone should be able to leverage. I really look forward to being able to bring my expertise and experience, particularly with large-scale and streaming data platforms, to AYLIEN to help broaden their already impressive offerings. AI is just reaching that critical point of moving beyond academia and reaching wide-scale adoption. Making its power accessible to the wider community beyond very focused experts is a really interesting and exciting challenge.”

When he’s not in AYLIEN, Hunter can be found spending time with his wife and foster children, messing around learning yet another programming language, painting minis, playing board games, tabletop RPG’s and wargames, or spending too much time playing video games.  He’s also been known to do some Salsa dancing, traveling, sailing, and scuba diving.

Check out Hunter’s talks on his most recent work at ClojureConj and this year’s Kafka Summit.






Text Analysis API - Sign up




 

0

Juggernaut is an experimental Neural Network, written in Rust. It is a feed-forward neural network that uses gradient descent to fit the model and train the network. Juggernaut enables us to build web applications that can train and evaluate a neural network model in the context of the web browser. This is done without having any servers or backends and without using Javascript to train the model.

Juggernaut’s developer-friendly API makes it easy to interact with. You can pass a dataset to Juggernaut from a CSV file or simply use the programmatic API to add documents to the model, and then ask the framework to train it. Juggernaut implements most activation functions as well as a few different cost functions, including Cross Entropy.

Juggernaut has a demo page, written with React and D3.js which illustrates the network, weights, and loss during a training session.

Demo

The demo page enables users to define a few options before starting the training session. These options are:

  • Dataset
  • Learning rate
  • Number of epochs (iterations)

In order to make the demo page more intuitive and easier to use, there are a few predefined datasets available on the page which load and illustrate data points from a CSV file. Each dataset has 3 classes, orange, blue and green and 2 features, X and Y.

Juggernaut three blobs one

After selecting the dataset and defining the options, you can start the training session by clicking the “Train” button on the page. Clicking on this button will spawn a new thread (web worker) and pass the dataset and parameters to the created thread.

During the training session, you can see the number of epochs, loss, and weights of the network. The web worker communicates with the main thread of the browser and sends the result back to the render thread to visualize each step of training.

Juggernaut Neural Net

The number of layers is predefined in the application. We have one input layer, two hidden layers, and one output. For hidden layers, we use ReLU activation function and the output layer uses Softmax with Cross Entropy cost function.

Compiling Rust to Web Assembly

Juggernaut’s demo page uses Web Assembly and HTML5 Web Worker to spawn a new thread inside the context of a web browser, and communicates between the web worker and the browser’s render thread (main thread) to train and evaluate the model.

Below is the process of compiling Rust to Web Assembly:

rust-javascript-17-638Juggernaut does not use any kind of Javascript codes to train and evaluate a model. However, it is still possible to run Juggernaut on modern web browsers, without having any backend servers as most modern web browsers, including IE 11 and portable web browsers on Android and iOS, support Web Assembly (source: http://caniuse.com/#search=wasm).

Importantly, the demo page uses a separate thread to train and evaluate a model and does not block the main thread or render thread of the web browser. So you can still interact with the UI elements of the page during training or you can keep the training session running for some time until receiving the accurate evaluation from the framework.

“Juggernaut”?

The real Juggernaut

Juggernaut is a Dota2 hero and I like this hero. Juggernaut is powerful, when he has enough farm.






Text Analysis API - Sign up





0

With the world’s media now publishing news content at web scale, it’s possible to leverage this content and discover what news the world is consuming in real time. Using the AYLIEN News API, you can both look for high-level trends in global media coverage and also dive into the content to discover what the world is talking about.   

So in this blog we’re going to take a high-level look at some of the more than 2.5 million stories our News API gathered, analyzed, and indexed last month, and see what we find. Instead of searching for stories using a detailed search query, we’re simply going to retrieve articles written in English and see what patterns emerge in this content.

To understand the distribution of articles published over time, we’ll use the Time Series endpoint. This endpoint allows us to see the volume of stories published over time, according to whatever parameters you set. To get an overview of recent trends in content publishing, we simply set the language parameter to English. Take a look at the pattern that emerges over the past two months:

The first thing you’ll notice is how steady the publishing industry’s patterns are – there is a steady output of around 60,000 new stories in English every weekday, dropping to about 30,000 stories on weekends. This pattern is very regular, except in the last week of the month, when a small but noticeable spike in story volume occurs.

What caused this spike in story volume?

To find the cause of these extra 2,000 – 4,000 stories, we browsed the volume numbers of the biggest categories to see if we could identify a particular subject category which followed the same pattern. We found an unmistakable match in the Finance category – as well as taking place in the same period, this spike also matches the volume of extra stories – roughly an extra 2,000 stories above the daily average.

In addition to this, we also found a similar spike at the end of July. Take a look at the daily story volume of finance stories published over the past six months:

What topics were discussed in this content?

Knowing that the increase in story volume in the last week of October was due to a spike in the number of Finance stories is great, but we can go further and see what was actually being talked about in these stories. To do this, we leveraged our News API’s capability to analyze the keywords, entities and concepts mentioned. The News API allows you to discover trends like this in news stories using the Trends endpoint.

Analyzing keywords lets us get an overview of what people, organizations, and things are mentioned most in the roughly 10,000 finance stories published in that week. Looking at the chart below, it’s a pretty easy to see what caused the spike.

From the results shown in the bubble chart it’s easy to see that keywords and concepts like “quarter”, “earnings”, “financial” and “company” were identified. From this analysis we can make a good guess that a lot of this content published in the last week of October was related to quarterly results and financial reporting by companies. This makes a lot of sense, since in the Time Series chart we could see that a similar spike occurred at the end of July, three months ago.

We thought this was interesting – why was so much published about something so arcane to the general public? In the first graph, we could see the spike that these quarterly earnings reports caused on story volume was visible on a chart of all stories published in the English language. But we don’t think quarterly earnings reports would make the everyday news content consumer drop everything they are doing to check the news.

Why were people interested in quarterly earnings reports?

So we know from the spike in story volume that the media were interested in the quarterly earnings reports. But what were social media users interested in? To find out, we decided to gather the most-shared stories from the Finance categories from the last week of October – the week of the spike. This will tell us if there was a particular aspect to the earnings reports that prompted such a spike in this topic.

The News API lets us do that with the Stories endpoint, by simply searching for the most-shared stories from the Finance category during last week of October across Facebook, LinkedIn, and Reddit.

You can see that of the nine stories we gathered, five are about the earnings reports, and despite this being quite a business-focused topic, Facebook was the network that this content was most popular on, not LinkedIn.

Facebook:

  1. Swiss bank UBS reports 14 percent growth in 3Q net profit,” Associated Press, 39,907 shares
  2. Amazon shares soar as earnings beat expectations,”Associated Press, 39,870 shares
  3. US stocks higher as banks and technology companies rebound,” Associated Press,  39,867 shares

LinkedIn:

  1. Jeff Bezos is now the richest man in the world with $90 billion,” CNBC, 7,318 shares
  2. CVS Reportedly Looking To Buy Aetna Insurance For $66 Billion,” Consumerist, 2,172 shares
  3. New Uber Visa Credit Card From Barclays Coming Next Week,” Forbes, 1,967 shares

Reddit:

  1. First reading on third-quarter GDP up 3.0%, vs 2.5% rise expected,” CNBC, 14,807 upvotes
  2. New study says Obamacare premiums will jump in 2018 — in large part because of Trump,” Business Insider, 6,255 upvotes
  3. MSNBC host literally left his seat to fact-check Jim Renacci,” Cleveland, 4,641 upvotes

 

You can see above that of the nine most-shared stories on social media on the week of the spike caused by the quarterly earnings reports, only five actually mention the reports. This suggests to us that although the media published a huge amount on the reports, people in general weren’t too interested in them.

Since we are basing this assumption on just a few headlines above, it’s just a hunch. But with the News API, we can put hunches like this to the test by analyzing quantitative data.

Exactly how interested were people in quarterly earnings reports?

In order to bit more accurate about how interested people were in the quarterly earnings reports, we compared the share counts of the 100 most-shared stories from the week of the spike with those from the corresponding week last month. We can do this using the News API’s Stories endpoint, since the News API monitors the share count of every story it indexes. Also, we’re going to look at Facebook since in the previous section it was the social network most interested in the quarterly earnings reports.

Take a look at how often people were sharing the 100 most-shared stories on Facebook in the last week of October:  

You can see that people were sharing Finance stories less often in the last week of October than the same period in September. This is interesting because we already saw that that there were over three times more Finance stories published in the same period, so we have to assume that people on social media generally just weren’t interested in these stories.

This piece of information is interesting because it shows us that looking at viral stories about a subject can mislead us about how interested people are about that subject.

Well that concludes this month’s roundup of news with the News API. If you want to dive into the world’s news content and use text analysis to extract insights, our News API has a two-week free trial, which you can activate by clicking the link below.






News API - Sign up





0

In a previous blog, we showed you how easy it is to set up a simple social listening tool to monitor chatter on Twitter. We showed you how using a single Python script you can gather  recent Tweets about a topic that interested you, analyze the sentiment of these tweets, and produce a visualization showing the sentiment.

But as we also pointed out in that blog, Twitter users post 350,000 new Tweets every minute, giving us an incredibly live dataset that will contain new information every time we query it. In fact, Twitter is so responsive to emerging trends that the US Geological Survey uses it to detect earthquakes – because the rate at which people Tweet about these earthquakes outpaces even their own data pipelines of geological and seismic data.

So if we can rely on Twitter users to Tweet about earthquakes during the actual earthquakes, we can absolutely rely on them to Tweet their opinions about subjects important to you, in real time. So while a single sentiment analysis gives you a snapshot of what people are saying at a single moment in time, it’s even more useful to analyze sentiment on a regular basis in order to understand how public opinion or even better your customers opinions can change over time

Why is having an automated sentiment analysis workflow useful?

There are two main reasons.

  • It will allow you to keep abreast of any trends in consumer sentiment shown towards you, your products, or your competitors.
  • Over time, this workflow will build up an extremely valuable longitudinal dataset that you can compare with sales trends, website traffic, or any of your KPIs.

In this blog, we’re going to show you how you can turn the original script we shared into a fully automated tool that will run your analysis at a given time everyday or however frequently you schedule it to run. It will gather Tweets, analyze their sentiment, and if you want it to, it will produce a temporary visualization so you can read the data quickly. It will also gradually build up a record of the Tweets and their sentiment by adding each day’s result to the same CSV file.

 

The 4 Steps for setting up your Twitter Sentiment Analysis Tool (in 20 mins)

This blog is split up into 4 steps, which all together should take about 20 minutes to complete:

  1. Get your credentials from Twitter and AYLIEN – both of these are free. (at 10 mins this is  the most time consuming part)
  2. Set up a folder and copy the Python script into it (2 mins)
  3. Run the Python script for your first sentiment analysis (2 mins)
  4. Schedule the script to run every day – we’ve included a detailed guide for both Windows and Mac below. (5 mins)

 

Step 1: Getting your credentials

If you completed the last blog, you can skip this part, but if you didn’t, follow these three steps:

  1. Make sure you have the following libraries installed (which you can do with pip):
  1. Get API keys for Twitter:
  • Getting the API keys from Twitter Developer (which you can do here) is the most time consuming part of this process, but this video can help you if you get lost.
  • What it costs & what you get: the free Twitter plan lets you download 100 Tweets per search, and you can search Tweets from the previous seven days. If you want to upgrade from either of these limits, you’ll need to pay for the Enterprise plan ($$)
  1. Get API keys for AYLIEN:
  • To do the sentiment analysis, you’ll need to sign up for our Text API’s free plan and grab your API keys, which you can do here.
  • What it costs & what you get: the free Text API plan lets you analyze 30,000 pieces of text per month (1,000 per day). If you want to make more than 1,000 calls per day, our Micro plan lets you analyze 80,000 pieces of text for ($49/month)

 

Step 2: Set up a folder and copy the Python script into it

Setting up a folder for this project will make everything a lot tidier and easier in the long run. So create a new one, and copy the Python script below into it. After you run this script for the first time, it will create a CSV in the folder, which is where it will store the Tweets and their sentiment every time it runs.

Here is the Python script:


import os
import sys
import csv
import tweepy
import matplotlib.pyplot as plt

from collections import Counter
from aylienapiclient import textapi

open_kwargs = {}

if sys.version_info[0] < 3:
    input = raw_input
else:
    open_kwargs = {'newline': ''}



# Twitter credentials
consumer_key = "Your consumer key here"
consumer_secret = "your secret consumer key here"
access_token = "your access token here"
access_token_secret = "your secret access token here"

# AYLIEN credentials
application_id = "your app id here"
application_key = "your app key here"

# set up an instance of Tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

# set up an instance of the AYLIEN Text API
client = textapi.Client(application_id, application_key)

#here insert a few lines to open the previous CSV file and read the last entry for id
max_id = 0

file_name = 'Sentiment_Analysis_of_Tweets_About_Your_query.csv'

if os.path.exists(file_name):
    with open(file_name, 'r') as f:
        for row in csv.DictReader(f):
            max_id = row['Tweet_ID']
else:
	    with open(file_name, 'w', **open_kwargs) as f:
	        csv.writer(f).writerow([
	                                "Tweet_ID",
	                                "Time",
	                                "Tweet",
	                                "Sentiment"])

results = api.search(
    lang="en",
    q="Your_query -rt",
    result_type="recent",
    count=10,
    since_id=max_id
)

results = sorted(results, key=lambda x: x.id)

print("--- Gathered Tweets \n")

# open a csv file to store the Tweets and their sentiment
with open(file_name, 'a', **open_kwargs) as csvfile:
    csv_writer = csv.DictWriter(
        f=csvfile,
        fieldnames=[
                    "Tweet_ID",
                    "Time",
                    "Tweet",
                    "Sentiment"]
    )

    print("--- Opened a CSV file to store the results of your sentiment analysis... \n")

    # tidy up the Tweets and send each to the AYLIEN Text API
    for c, result in enumerate(results, start=1):
        tweet = result.text
        tidy_tweet = tweet.strip().encode('ascii', 'ignore')
        tweet_time = result.created_at
        tweet_id = result.id

        if not tweet:
            print('Empty Tweet')
            continue

        response = client.Sentiment({'text': tidy_tweet})
        csv_writer.writerow({
        	"Tweet_ID": tweet_id,
        	"Time": tweet_time,
            'Tweet': response['text'],
            'Sentiment': response['polarity'],
        })

        print("Analyzed Tweet {}".format(c))

# count the data in the Sentiment column of the CSV file
with open(file_name, 'r') as data:
    counter = Counter()
    for row in csv.DictReader(data):
        counter[row['Sentiment']] += 1

    positive = counter['positive']
    negative = counter['negative']
    neutral = counter['neutral']

# declare the variables for the pie chart, using the Counter variables for "sizes"
colors = ['green', 'red', 'grey']
sizes = [positive, negative, neutral]
labels = 'Positive', 'Negative', 'Neutral'

# use matplotlib to plot the chart
plt.pie(
    x=sizes,
    shadow=True,
    colors=colors,
    labels=labels,
    startangle=90
)

plt.title("Sentiment of {} Tweets about Your Subject".format(sum(counter.values())))
plt.show()

If you want any part of that script explained, our previous blog breaks it up into pieces and explains each one.

Step 3: Run the Python script for your first sentiment analysis

Before you run the script above, you’ll need to make two simple changes. First, enter in your access keys from Twitter and AYLIEN. Second, don’t forget to enter in what it is your want to analyze! You’ll need to do this in two places first, on line 40 you should change the name of the CSV file that the script is going to create (currently it’s file_name = ‘Sentiment_Analysis_of_Tweets_About_Your_query.csv’). Second, on line 56, replace the text “Your Query” with whatever your query is.

Also, if you don’t feel that you need a daily report based and you’re more interested in building up a record of sentiment analysis, just delete everything below “## count the data in the Sentiment column of the CSV file”. This will carry out the sentiment analysis every day and add the results to the CSV file,  but won’t show a visualization.

After you’ve run this script, your folder should contain the Python script and the CSV file.

Step 4: Schedule the Python script to run every day.

So now that you’ve got your Python script saved in a folder along with a CSV file containing the results of your first sentiment analysis, we’re ready for the final step – scheduling the script to run to a schedule that suits you.

Depending on whether you use Windows or Mac, there are different steps to take here, but you don’t need to install anything either way. Mac users will use Cron, whereas Windows users can use Task Scheduler.

Windows:

  1. Open Task Scheduler (search for it in the Start menu)
  2. Click into Task Scheduler Library on the left of your screen
  3. Open the ‘Create Task’ box on the right of your screen
  4. Give your task a name
  5. Add a new Trigger: select ‘daily’ and enter the time of day you want your computer to run the script. Remember to select a time of day that your computer is likely to be running.
  6. Open up the Actions tab and click New
  7. In the box that opens up, make sure “Start a program” is selected
  8. In the “Program/Script” box, enter the path to your python.exe file and make sure this is enclosed in quotation marks (so something like “C:\Program Files (x86)\Python36-32\python.exe”)
  9. In the “Add arguments” box, enter the path to the dailySentiment.py file, including the file itself (so something like C:\Users\Yourname\desktop\folder\dailySentiment.py). No quotation marks are needed here.
  10. In the “Start in” box, enter the path to the containing folder with your script and CSV file. (C:\Users\Yourname\desktop or wherever your folder is\the name of your folder\). Again, no quotation marks are needed.
  11. You’re done!

Mac:

  1. Open up a Terminal
  2. Type “crontab – e” to create a new Cron job.
  3. Scheduling the Cron job takes one line of code, that is split into three parts.
  4. First, type the time of the week you want your script to run, according to the Cron format, which is “minute hour date month weekday” all in integers, all separated by single spaces. For example, if you want your script to gather Tweets every day at 9AM, this first part of the line of code will read “ 0 9 * * * ”   – minute zero of hour nine, every day of every month.
  5. Second, leave a space after this first line and type in the location of your Python executable file. This part will usually read something like “System/Library/Frameworks/Python.framework/Python”
  6. Finally, enter the path to the Python script in your folder. For example, if you saved the scripts to a folder on your desktop, the path will be something like “Users/Your name/Desktop/folder name/dailySentiment.py”
  7. The full line of code in your Terminal will look something like “0 9 * * * /System/Library/Frameworks/Python.framework/Python Users/Your name/Desktop/folder name/dailySentiment.py”.
  8. Now hit Escape, then type “:wq”, and hot Enter.
  9. To double check that your Cron job is scheduled, type “crontab -l” and you should see your job listed.

If you run into trouble, get in touch!

With those four steps, your automated workflow should be up and running, but depending on how your system is set up, you could run into an error along the way. If you do, don’t hesitate to get in touch by sending us an email, leaving a comment, or chatting with us on our site.

Happy analyzing!






Text Analysis API - Sign up




1

Twitter users around the world post around 350,000 new Tweets every minute, creating 6,000 140-character long pieces of information every second. Twitter is now a hugely valuable resource from which you can extract insights by using text mining tools like sentiment analysis.

Within the social chatter being generated every second, there are vast amounts of hugely valuable insights waiting to be extracted. With sentiment analysis, we can generate insights about consumers’ reactions to announcements, opinions on products or brands, and even track opinion about events as they unfold. For this reason, you’ll often hear sentiment analysis referred to as “opinion mining”.

With this in mind, we decided to put together a useful tool built on a single Python script to help you get started mining public opinion on Twitter.

What the script does

Using this one script you can gather Tweets with the Twitter API, analyze their sentiment with the AYLIEN Text Analysis API, and visualize the results with matplotlib – all for free. The script also provides a visualization and saves the results for you neatly in a CSV file to make the reporting and analysis a little bit smoother.

Here are some of the cool things you do with this script:

  • Understand the public’s reaction to news or events on Twitter
  • Measure the voice of your customers and their opinions on you or your competitors
  • Generate sales leads by identifying negative mentions of your competitors

You can see the script running a sample analysis of 50 Tweets mentioning Tesla in our example GIF below – storing the results in a CSV file and showing a visualization. The beauty of the script is you can search for whatever you like and it will run your tweets through the same analysis pipeline. 😉

Tesla Sentiment

 

Installing the dependencies & getting API keys

Since doing a sentiment analysis of Tweets with our API is so easy, installing the libraries and getting your API keys is by far the most time-consuming part of this blog.

We’ve collected them here as a four-step to do list:

  1. Make sure you have the following libraries installed (which you can do with pip):
  1. Get API keys for Twitter:
  • Getting the API keys from Twitter Developer (which you can do here) is the most time consuming part of this process, but this video can help you if you get lost.
  • What it costs & what you get: the free Twitter plan lets you download 100 Tweets per search, and you can search Tweets from the previous seven days. If you want to upgrade from either of these limits, you’ll need to pay for the Enterprise plan ($$)
  1. Get API keys for AYLIEN:
  • To do the sentiment analysis, you’ll need to sign up for our Text API’s free plan and grab your API keys, which you can do here.
  • What it costs & what you get: the free Text API plan lets you analyze 30,000 pieces of text per month (1,000 per day). If you want to make more than 1,000 calls per day, our Micro plan lets you analyze 80,000 pieces of text for ($49/month)
  1. Copy, paste, and run the script below!

 

The Python script

When you run this script it will ask you to specify what term you want to search Tweets for, and then to specify how many Tweets you want to gather and analyze.


import sys
import csv
import tweepy
import matplotlib.pyplot as plt

from collections import Counter
from aylienapiclient import textapi

if sys.version_info[0] < 3:
   input = raw_input

## Twitter credentials
consumer_key = "Your consumer key here"
consumer_secret = "your secret consumer key here"
access_token = "your access token here"
access_token_secret = "your secret access token here"

## AYLIEN credentials
application_id = "Your app ID here"
application_key = "Your app key here"

## set up an instance of Tweepy
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)

## set up an instance of the AYLIEN Text API
client = textapi.Client(application_id, application_key)

## search Twitter for something that interests you
query = input("What subject do you want to analyze for this example? \n")
number = input("How many Tweets do you want to analyze? \n")

results = api.search(
   lang="en",
   q=query + " -rt",
   count=number,
   result_type="recent"
)

print("--- Gathered Tweets \n")

## open a csv file to store the Tweets and their sentiment 
file_name = 'Sentiment_Analysis_of_{}_Tweets_About_{}.csv'.format(number, query)

with open(file_name, 'w', newline='') as csvfile:
   csv_writer = csv.DictWriter(
       f=csvfile,
       fieldnames=["Tweet", "Sentiment"]
   )
   csv_writer.writeheader()

   print("--- Opened a CSV file to store the results of your sentiment analysis... \n")

## tidy up the Tweets and send each to the AYLIEN Text API
   for c, result in enumerate(results, start=1):
       tweet = result.text
       tidy_tweet = tweet.strip().encode('ascii', 'ignore')

       if len(tweet) == 0:
           print('Empty Tweet')
           continue

       response = client.Sentiment({'text': tidy_tweet})
       csv_writer.writerow({
           'Tweet': response['text'],
           'Sentiment': response['polarity']
       })

       print("Analyzed Tweet {}".format(c))

## count the data in the Sentiment column of the CSV file 
with open(file_name, 'r') as data:
   counter = Counter()
   for row in csv.DictReader(data):
       counter[row['Sentiment']] += 1

   positive = counter['positive']
   negative = counter['negative']
   neutral = counter['neutral']

## declare the variables for the pie chart, using the Counter variables for "sizes"
colors = ['green', 'red', 'grey']
sizes = [positive, negative, neutral]
labels = 'Positive', 'Negative', 'Neutral'

## use matplotlib to plot the chart
plt.pie(
   x=sizes,
   shadow=True,
   colors=colors,
   labels=labels,
   startangle=90
)

plt.title("Sentiment of {} Tweets about {}".format(number, query))
plt.show()

If you’re new to Python, text mining, or sentiment analysis, the next sections will walk through the main sections of the script.

 

The script in detail

Python 2 & 3

With the migration from Python 2 to Python 3, you can run into a ton of problems working with text data (if you’re interested, check out a great summary of why by Nick Coghlan). One of the changes is that Python 3 runs input() as a string, whereas Python 2 runs input() as a Python expression, so these lines change this to raw_input() if you’re running Python 2.


if sys.version_info[0] < 3:
   input = raw_input

Input your search

The goal of this post is to make it as quick and easy as possible to analyze the sentiment of Tweets that interest you. This script does that by letting you easily change the search term and sample size every time you run the script from the shell using the input() method.


query = input("What subject do you want to analyze for this example? \n")
number = input("How many Tweets do you want to analyze? \n")

Run your Twitter query

We’re grabbing the most recent Tweets relevant to your query, but you can change this to ‘popular’ if you want to mine only the most popular Tweets published, or ‘mixed’ for a bit of both. You can see we’ve also decided to exclude retweets, but you might decide that you want to include them. You can check the full list of parameters here. (From our experience there can be a lot of noise in retaining Tweets that have been Retweeted.)

An important point to note here is that the Twitter API limits your results to 100 Tweets, and it doesn’t return an error message if you try to search for more than 100 Tweets. So if you input 500 Tweets, you’ll only have 100 Tweets to analyze, and title of your visualization will still read ‘500 Tweets.’


results = api.search(
   lang="en",
   q=query + " -rt",
   count=number,
   result_type="recent"
)

Open a CSV file for the Tweets & Sentiment Analysis

Writing the Tweets and their sentiment to a CSV file allows you to review the API’s analysis of each Tweet. First, we open a new CSV file and write the headers.


with open(file_name, 'w', newline='') as csvfile:
   csv_writer = csv.DictWriter(
       f=csvfile,
       fieldnames=["Tweet", "Sentiment"]
   )
   csv_writer.writeheader()

Tidy the Tweets

Dealing with text on Twitter can be messy, so we’ve included this snippet to tidy up the Tweets before you do the sentiment analysis. This means that your results are more accurate, and you also don’t waste your free AYLIEN credits on empty Tweets. 😉


for c, result in enumerate(results, start=1):
   tweet = result.text
   tidy_tweet = tweet.strip().encode('ascii', 'ignore')

   if len(tweet) == 0:
       print('Empty Tweet')
       continue

Write the Tweets & their Sentiment to the CSV File

You can see that actually getting the sentiment of a piece of text only takes a couple of lines of code, and here we’re writing the Tweet itself and the result of the sentiment (positive, negative, or neutral) to the CSV file under the headers we already wrote. You’ll notice that we’re actually writing the Tweet as returned by the AYLIEN Text API instead of the Tweet we got from the Twitter API. Even though they’re both the same, writing the Tweet that the AYLIEN API returns just reduces the potential for errors and mistakes.  

We’re also going to print something every time the script analyzes a Tweet.


response = client.Sentiment({'text': tidy_tweet})
csv_writer.writerow({
   'Tweet': response['text'],
   'Sentiment': response['polarity']
})

print("Analyzed Tweet {}".format(c))

Screenshot (546)

If you want to include results on how confident the API is in the sentiment it detects in each Tweet: just add  “response[‘polarity_confidence’]” above and add a corresponding header when you’re opening your CSV file.

Count the results of the Sentiment Analysis

Now that we’ve got a CSV file with the Tweets we’ve gathered and their predicted sentiment, it’s time to visualize these results so we can get an idea of the sentiment immediately. To do this, we’re just going to use Python’s standard counter library to count the number of times each sentiment polarity appears in the ‘Sentiment’ column.


with open(file_name, 'r') as data:
   counter = Counter()
   for row in csv.DictReader(data):
       counter[row['Sentiment']] += 1

   positive = counter['positive']
   negative = counter['negative']
   neutral = counter['neutral']

Visualize the Sentiment of the Tweets

Finally, we’re going to plot the results of the count above on a simple pie chart with matplotlib. This is just a case of declaring the variables and then using matplotlib to base the sizes, labels, and colors of the chart on these variables.


colors = ['green', 'red', 'grey']
sizes = [positive, negative, neutral]
labels = 'Positive', 'Negative', 'Neutral'

## use matplotlib to plot the chart
plt.pie(
   x=sizes,
   shadow=True,
   colors=colors,
   labels=labels,
   startangle=90
)

plt.title("Sentiment of {} Tweets about {}".format(number, query))
plt.show()

Screenshot (542)

Go further with Sentiment Analysis

If you want to go further with sentiment analysis you can try two things with your AYLIEN API keys:

  • If you’re looking into reviews of restaurants, hotels, cars, or airlines, you can try our aspect-based sentiment analysis feature. This will tell you what sentiment is attached to each aspect of a Tweet – for example positive sentiment shown towards food but negative sentiment shown towards staff.
  • If you want sentiment analysis customized for the problem you’re trying to solve, take a look at TAP, which lets you train your own language model from your browser.

Building a Sentiment Analysis Workflow for your Organization

This script is built to give you a snapshot of sentiment at the time you run it, so to keep abreast of any change in sentiment towards an organization you’re interested in, you should try running this script every day.

In our next blog, we’ll have a couple of simple updates for this script that will set up a simple, fully automated process that will keep an eye on the sentiment on Twitter for your anything that you’re interested in.






Text Analysis API - Sign up




6

PREVIOUS POSTSPage 2 of 20NEXT POSTS