Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!
Your address will show here +12 34 56 78
Data Science

As you may know we recently launched a new service offering, our News API, and over the past week or so we’ve been using it to run some little experiments around analyzing news content.

We wanted to use the News API to collect and analyze popular news headlines. We set out to find both similarities and differences in the way two journalists write headlines for their respective news articles and blog posts. The two reporters we selected operate in, and write about, two very different industries/topics and have two very different writing styles:

  • Finance: Akin Oyedele of Business Insider, who covers market updates.
  • Celebrity: Carly Ledbetter of the Huffington Post, who mainly writes about celebrities.

 

Note: For a more technical, in-depth and interactive representation of this project, check out the Jupyter notebook we created. This includes sample code and more in depth descriptions of our approach.

The Approach

We set out some clear steps to follow in comparing the writings of our two selected authors:

  1. Collect news headlines from both of our journalists
  2. Create parse trees from collected headlines (we explain parse trees below!)
  3. Extract information from each parse tree that is indicative of the overall headline structure
  4. Define a simple sequence similarity metric to quantitatively compare any pair of headlines
  5. Apply the same metric to all headlines collected for each author to find similarity
  6. Use K-Means and tSNE to produce a visual map of all the headlines so we can clearly see the differences between our two journalists

 

So what exactly are parse trees?

In linguistics, a parse tree is a rooted tree that represents the syntactic structure of a sentence, according to some pre-defined grammar. Here’s an example;

For example with a simple sentence like “The cat sat on the mat”, a parse tree might look like this;

 

Thankfully parsing our extracted headlines isn’t too difficult. We used the Pattern Library for Python to parse the headlines and generate our parse trees.

 

Data

In total we gathered about 700 article headlines for both journalists using the AYLIEN News API which we then analyzed using Python. If you’d like to give it a go yourself, you can grab the Pickled data files directly from the GitHub repository (link), or by using the data collection notebook we prepared for this project.

First we loaded all the headlines for Akin Oyedele, then we created parse trees for all 700 of them, and finally we stored them together with some basic information about the headline in the same Python object.

Then using a sequence similarity metric, we compared all of these headlines two by two, to build a similarity matrix.

 

Visualizations

To visualize headline similarities for Akin we generated a 2D scatter plot with the hope of grouping similarly structured headlines close together in a graph in groups of sorts.

To achieve this, we first reduced the dimensionality of our similarity matrix using tSNE and applied K-Means clustering to find groups of similar headlines. We also used some a nice viz library which we’ve outlined below;

  • tSNE to reduce the dimensionality of our similarity matrix from 700 down to 2
  • K-Means to identify 5 clusters of similar headlines and add some color
  • Plotted the actual chart using Bokeh

 

The chart above shows a number of dense groups of headlines, as well as some sparse ones. Each dot on the graph represents a headline, as you can see when you hover over one in the interactive version. Similar titles are as you can see grouped together quite cleanly. Some of the more stand-outish groups are;

  • The circular group left of center typically consists of short, snappy stock update headlines such as “Viacom is crashing”
  • The large circular group on the top right are mostly announcement-style headlines such as “Here come the…. “ formats.
  • The small green circular group towards the bottom left are similar and use the same phrases we see headlines such as “Industrial production falls more than expected” or “ADP private payrolls rise more than expected”.

 

Comparing the two authors

By repeating the process for our second journalist, Carly Ledbetter, we were then able to compare both authors and see how many common patterns exist between the two in terms of how they write their headlines.

We observed that roughly 50% (347/700) of the headlines had a similar structure.


Here we can see the same dense and sparse patterns, as well as groups of points that are somewhat unique to each author, or shared by both authors. The yellow dots represent our Celebrity focused author and the blue our finance guy.

  • The bottom right cluster is almost exclusive to the first author, as it covers the short financial/stock report headlines such as “Here comes CPI”, but it also covers some of the headlines from the first author such as “There’s Another Leonardo DiCaprio Doppelgänger”. Same could be said about the top middle cluster.
  • The top right cluster mostly contains single-verb headlines about celebrities doing things, such as “Kylie Jenner Graces Coachella With Her Peachy Presence” or “Kate Hudson Celebrated Her Birthday With A Few Shirtless Men” but it also includes market report headlines from the first author such as “Oil rig count plunges for 7th straight week”.

 

Conclusion and future work

In this project we’ve shown how you can retrieve and analyze news headlines, evaluate their structure and similarity, and visualize the results on an interactive map.

While we were quite happy with the results and found it quite interesting there were some areas that we thought could be improved. Some of the weaknesses of our approach, and ways to improve them are:

– Using entire parse trees instead of just the chunk types

– Using a tree or graph similarity metric instead of a sequence similarity one (ideally a linguistic-aware one too)

– Better pre-processing to identify and normalize Named Entities, etc.

Next up..

In our next post, we’re going to study the correlations between various headline structures and some external metrics like number of Shares and Likes on Social Media platforms, and see if we can uncover any interesting patterns. We can hazard a guess already that the short, snappy Celebrity style headlines would probably get the most shares and reach on social media, but there’s only one way to find out.

If you’d like to access the data used or want to see the sample code we used head over to our Jupyter notebook.

 




Text Analysis API - Sign up




0

You asked for it and we’re delighted to bring it to you – our Aspect-Based Sentiment Analysis endpoint is now live within our Text Analysis API. In this post we will show you the benefits of our latest feature and run through some real-world examples from a variety of industries.

New to Sentiment Analysis? No problem. Let’s quickly get you up to speed;

What is Sentiment Analysis?

Sentiment Analysis is used to detect positive or negative polarity in text. Also known as opinion mining, sentiment analysis is a feature of text analysis and natural language processing (NLP) research that is increasingly growing in popularity as a multitude of use-cases emerge. Here’s a few examples of questions that sentiment analysis can help answer in various industries;

  • Brands – are people speaking positively or negatively when they mention my brand on social media?
  • Hospitality – what percentage of online reviews for my hotel/restaurant are positive/negative?
  • Finance – are there negative trends developing around my investments, partners or clients?
  • Politics – which candidate is receiving more positive media coverage in the past week?

We could go on and on with an endless list of examples but we’re sure you get the gist of it. Sentiment Analysis can help you understand the split in opinion from almost any body of text, website or document – an ideal way to uncover the true voice of the customer.

Aspect-Based Sentiment Analysis

While sentiment analysis provides fantastic insights and has a wide range of real-world applications, the overall sentiment of a piece of text won’t always pinpoint the root cause of an author’s opinion. When analyzing larger pieces of text an overall polarity score sometimes doesn’t give a true reflection of whether a piece of text is positive, negative or neutral.

Certain types of documents, such as customer feedback or reviews, may contain fine-grained sentiment about different aspects (e.g. a product or service) that are mentioned in the document. For instance, a review about a hotel may contain opinionated sentences about its staff, beds and location. This information can be highly valuable for understanding customers’ opinion about a particular service or product.

This is where Aspect-Based Sentiment Analysis (ABSA) comes in. With ABSA, you can dive deeper and analyze the sentiment in a piece of text toward industry-specific aspects.

The whole idea behind Aspect-Based Sentiment Analysis is to provide a way for our users to extract specific aspects from a piece of text and determine the sentiment towards each aspect individually. Our customers use it to analyze reviews, facebook comments, tweets and customer feedback forms to determine not just the sentiment of the overall text but what, in particular, the author likes or dislikes from that text.

 

Supported domains (industries)

To begin with we are focusing on four popular domains, for which we’ve built specific sentiment models;

  • Airlines
  • Cars
  • Restaurants
  • Hotels

Watch this space, however, as we will be increasing this list going forward.

Let’s take a look at a real-life example to show you the level of insight that can be achieved through ABSA;

Example 1: Restaurant review

As a restaurant owner, let’s say I have received 700 online reviews that result in an overall review score of 8/10. I’m quite happy with this but clearly there is room for improvement. But where exactly? By running the customer reviews through our ABSA endpoint, I can immediately start to see what my customers like, and more importantly dislike. As an example, we analyzed a 5-star restaurant review and got the following results;

Review: We visited here during our recent trip to Sydney and overall we were very impressed. We decided to make a reservation online, which was quick and easy with instant confirmation. It was nice to be able to view the table layout and select our own online. The location is spectacular with stunning views of the harbour and Opera House. It truly was amazing. Despite this, however, the restaurant was  only about 25% full and so the atmosphere was a bit flat. Perhaps this was to our benefit as we received top class service from our waiter, Brandon, who was not only friendly and funny but extremely knowledgeable when it came to food and wine pairings. Speaking of wine, the list was extensive – we loved it – and it took us what seemed like an hour to eventually decide on a local Shiraz. Now on to the most important aspect, the food. Our seafood starters were delicious, as were out fillet steak mains. The one and only real disappointment was the dessert which was served with no real imagination and looked like it had been purchased yesterday at the local grocery store. All in all, my favourite Sydney restaurant so far. So many positives and really good value too. Highly recommend!

 

ABSA Result:

Screen Shot 2016-11-30 at 09.53.25

As you can see from the results above, the ABSA endpoint automatically pulls industry-specific aspects (such as food, drinks, reservations and value), performs sentiment analysis on each aspect and gives sample sentences to indicate examples of where the the score was derived.

Although this review was extremely positive and received a 5-star rating, we can still uncover certain aspects that may be in need of improvement. Our customer was clearly not impressed with their dessert and also found the general atmosphere of the restaurant to be a bit flat.

Analyze the other 699 online reviews and we will start to see clear, actionable trends emerging. Trends that would be invaluable to to the likes of restaurant owners and will help them pin-point problem areas in their service offering and give them a competitive advantage by uncovering them before they become a serious issue or before a bad reputation develops.

Example 2: Hotel Review

We found a not-so-great hotel review online and our ABSA endpoint produced the following results;

Review: 4 star, REALLY? This was the worst hotel I have ever stayed in. I wouldn’t even give it one star.  Where do I start.. the staff were so rude and dismissive from the very beginning. They had no interest in actually serving us whatsoever! The bellhop, who seemed very upset when we approached him and disturbed his YouTube viewing, slammed our room door and said something angrily in Spanish when we refused to tip him (he dropped my shoes into a water fountain en route to our room!!). The view from our room was nothing like it was described to us. It was basically a building site with some overgrown grass. Bizarre location kind of in the middle of nowhere. We also requested a double room, but surprise surprise they gave us two singles. To be fair, this issue was rectified immediately and the double beds was actually very comfortable. Fool me twice, shame on me.. we decided to eat in the hotel restaurant and it was more dreadful service from the staff. The food was laughable. My 4 year old daughter could do better. For the price we paid for this hotel, I was disgusted with how badly it was ran.  I just hope that my review, and those of others, will prevent people making the same mistake I did in booking this place. Oh, and they had no Wifi, unbelievable!

 

ABSA Result:

Screen Shot 2016-11-30 at 09.52.05

Again, you can see the industry-specific aspects being analyzed. In this case, the ABSA endpoints looked for hotel-related aspects such as beds, comfort, food/drinks, staff and room amenities.

Unlike our well-reviewed restaurant, our hotel has a lot more work to do than simply improving their dessert offerings! Despite the overwhelmingly negative sentiment of this review, one positive emerges – the quality of the beds. This really highlights the benefit of aspect-based sentiment analysis – it gives us a more granular analysis in comparison to the overall sentiment style of positive/neutral/negative sentiment analysis.

Example 3: Social Posts

While the ABSA endpoint works really well when analyzing it can also be effective in analyzing social posts. We all know that one of the most common places today for someone to connect with a brand is through social media. Brands and organizations know this and spend a lot of time and effort making sense of social interactions and conversations in order to understand the voice of their customers.

Take for example an airline’s Twitter feed. Diving into social posts to determine the overall sentiment of the social interactions airlines have with customers is useful, but the ability to look at these at a more granular level to understand what in particular customers like or dislike about their overall service is invaluable.

Take the following tweet as an example;


Result:

 

So in that short snippet of text, we managed to extract 3 different aspects that were mentioned and were able to determine that the tweeter was less than happy with the cabin crew, was disappointed about not having his luggage but was somewhat pleased with his plane arriving on time.

Conclusion

Naturally not every online review or social post will touch on every single aspect available. Some will be short and concise while others will be long and detailed. Thankfully, our Text Analysis API enables you to analyze text data at scale so you can see the bigger picture in terms of which aspects of your offering people are talking about and what in particular they like or dislike.

Making calls to the ABSA end point is really simple. If you’re an existing user head over to our documentation to learn about the different domains and aspects we cover and to grab some sample code. If you’re new to AYLIEN Text Analysis API, grab your free API key and start calling the API.

 




Text Analysis API - Sign up




0

It’s hard to believe that it has been almost a month since we launched our News API. We have been thrilled with the feedback we have been receiving from our users and wanted to start sharing some useful hints and tips for using the API. Here’s four to begin with :-).

1. Advanced search using boolean operators

To find what exactly you’re looking for while searching for articles, the News API supports standard boolean search operators such as AND, OR, NOT and brackets in the full-text fields such as Title, Body or Text:

  • AND: the AND operator can be used to combine multiple search terms that must appear together. Example: `fashion AND shoes` shows results that include both “fashion” and “shoes”.
  • OR: the OR operator can be used to combine multiple search terms that may or may not appear together. Example: `fashion OR shoes` shows results that include “fashion”, “shoes” or both.
  • NOT: the NOT or “-” operator can be used to exclude a word or phrase. Example: `fashion NOT shoes` or `fashion -shoes` shows results that include “fashion” but not “shoes”.
  • “ ”: double quotes can be used to enforce exact phrase matches. Example: “The Batman Begins” will only match results that include the entire phrase.
  • ( ): brackets can be used to group phrases together. Example: `fashion and (shoes or dresses)` matches results that either contain “fashion” and “shoes” OR “fashion” and “dresses”.

Let’s find articles that include “Trump” or “Cruz” in their title, and don’t include “Clinton” nor “Sanders” in their body:

 

API Call:

curl -X GET --header "Accept: application/json" --header "X-AYLIEN-NewsAPI-Application-ID: YOUR_APP_ID" --header "X-AYLIEN-NewsAPI-Application-Key: YOUR_APP_KEY" "https://api.newsapi.aylien.com/api/v1/stories?title=Trump%20OR%20Cruz&body=-Clinton%20-Sanders&language%5B%5D=en&sort_by=published_at&sort_direction=desc&cursor=*&per_page=10"

 

Results

 

2. Search article or headline

The News API allows you to search for articles based on the headline, body content or both. Here’s an example:

To search for articles that mention “Messi” in their title and “Barcelona” in their body we can run the following query:

curl -X GET --header "Accept: application/json" --header "X-AYLIEN-NewsAPI-Application-ID: YOUR_APP_ID" --header "X-AYLIEN-NewsAPI-Application-Key: YOUR_APP_KEY" "https://api.newsapi.aylien.com/api/v1/stories?title=Messi&language%5B%5D=en&sort_by=published_at&sort_direction=desc&cursor=*&per_page=10"

 

Results

 

Other things to try:

You can use the above filters together with boolean search (our previous tip) to run more precise queries. For example, search for articles that mention “Donald Trump” in their title, and don’t include any mentions to “Clinton” in their body.

 

3. Build your own recommendation system and generate related articles based on your own content

Chances are that you have seen the popular “From around the web” feature popping up across a multitude of blogs and news sites on the Internet.Wish you could build your own fully customized version of that? Well look no further.

 

 

Using the Related Stories endpoint of the News API, not only can you generate recommendations for articles retrieved through the News API, but also for any other piece of text. All you need is a title and a description. Let’s look at an example:

Let’s assume we want to show recommended articles on the Wikipedia article for Hillary Clinton. We can extract the title and a bit of description from the page using a short bit of code, or using the AYLIEN Article Extraction API. From the page we can extract the following:

 

 

Title: Hillary Clinton

Description: Hillary Diane Rodham Clinton /ˈhɪləri daɪˈæn ˈrɒdəm ˈklɪntən/ (born October 26, 1947) is an American politician. She is a candidate for the Democratic nomination for President of the United States in the 2016 election. She was the 67th United States Secretary of State from 2009 to 2013. From 2001 to 2009, Clinton served as a United States Senator from New York. She is the wife of the 42nd President of the United States Bill Clinton, and was First Lady of the United States during his tenure from 1993 to 2001.

Now let’s feed this to the Related Stories endpoint and see what we get back:

API Call:

curl -X POST --header "Content-Type: application/x-www-form-urlencoded" --header "Accept: application/json" --header "X-AYLIEN-NewsAPI-Application-ID: YOUR_APP_ID" --header "X-AYLIEN-NewsAPI-Application-Key: YOUR_APP_KEY" -d "story_title=Hillary%20Clinton&story_body=Hillary%20Diane%20Rodham%20Clinton%20%2F%CB%88h%C9%AAl%C9%99ri%20da%C9%AA%CB%88%C3%A6n%20%CB%88r%C9%92d%C9%99m%20%CB%88kl%C9%AAnt%C9%99n%2F%20(born%20October%2026%2C%201947)%20is%20an%20American%20politician.%20She%20is%20a%20candidate%20for%20the%20Democratic%20nomination%20for%20President%20of%20the%20United%20States%20in%20the%202016%20election.%20She%20was%20the%2067th%20United%20States%20Secretary%20of%20State%20from%202009%20to%202013.%20From%202001%20to%202009%2C%20Clinton%20served%20as%20a%20United%20States%20Senator%20from%20New%20York.%20She%20is%20the%20wife%20of%20the%2042nd%20President%20of%20the%20United%20States%20Bill%20Clinton%2C%20and%20was%20First%20Lady%20of%20the%20United%20States%20during%20his%20tenure%20from%201993%20to%202001.&boost_by=recency&per_page=5" "https://api.newsapi.aylien.com/api/v1/related_stories"

And we get the following articles back:

 

 

Other things to try:

You can use all the regular story filters such as Sentiment, Source or Category filters to further narrow down the scope of the returned stories, e.g. to get related articles with a positive tone, or related stories that are from a specific source, and so on.

In the above example we boosted the related stories by recency. You could also boost them by `popularity` to give a higher weight to how popular each story is on social media when sorting the related stories.

 

4. Trending articles on social media

The News API constantly monitors social media to measure the popularity of each and every article that it retrieves and indexes. This information can be used in two forms:

1. To sort articles by social popularity when searching for articles (using any of the /stories, /related_stories or /coverages endpoints), which can be done by setting the `sort_by` parameter to `social_shares_count`.

2. To profile each article’s popularity over time by looking at its `social_shares_count` property.

Let’s get the most popular article that mentions David Bowie from the last 60 days by setting `sort_by` to `social_shares_count` and `per_page` to 1:

 

API Call:

curl -X GET --header "Accept: application/json" --header "X-AYLIEN-NewsAPI-Application-ID: YOUR_APP_ID" --header "X-AYLIEN-NewsAPI-Application-Key: YOUR_APP_KEY" "https://api.newsapi.aylien.com/api/v1/stories?title=%22David%20Bowie%22&language%5B%5D=en&published_at.start=NOW-60DAYS&published_at.end=NOW&categories.confident=true&cluster=false&cluster.algorithm=lingo&sort_by=social_shares_count&sort_direction=desc&cursor=*&per_page=1"

 

We get the following article that has about 3,000 shares on Facebook:

 

 

We continuously and repeatedly profile articles for their social performance, so if you look at the JSON,under the hood you will see something like the following:

 

 

Which shows us how the popularity of this article has changed over time.

 

Conclusion

So there you go. 3 things that you may not have known you could do the AYLIEN News API. Check back regularly as we will be sharing more hints, tips and cool use-cases.

Want to try the News API for yourself? Click the image below for a 14-day free trial.





News API - Sign up




0

Data Science

Dubbed the biggest leak of its kind ever, bigger than WikiLeaks and Edward Snowden’s leak in 2013, the Panama Leaks has shed light on how the world’s rich and famous are moving and hiding money across the globe.

 

 

The information, released by the International Consortium of Investigative Journalists following a tip off from an anonymous source connected to German Newspaper Süddeutsche Zeitung, is a cache of over 11 Million documents which show how money is laundered through offshore accounts and entities.


The documents, which are not entirely public yet, show how the world’s super rich exploit international tax regimes in order to hide money and assets.

At the center of the controversy is a Panamanian law firm, Mossack Fonseca, who helped clients route and move money all over the world. Those incriminated in the leak range from world leaders such as Vladimir Putin to soccer stars such as Lionel Messi.

If you want to follow the action live, check out this reddit live page.

So why are we so interested in the leak at AYLIEN?

Well for one, we’re data geeks and the thought of mining a 2.6TB leak greatly appeals to us 😉 and apart from the fact that we care, we also regularly use world events like this to showcase our technology and solutions.

When the news broke, we thought, wouldn’t it be cool to mine the reports. We wanted to look for interesting data points like people mentioned, organizations, locations, topics discussed and so on. In total there is thought to be over 210,000 entities named and the documents dating as far back as the 70’s in a collection of emails, contracts, transcripts, photos and even passports. That’s a lot of interesting data to mine!

The actual documents haven’t been released yet, but there has been a massive amount of chatter on the subject across news outlets, blogs and social media. Using our News API we decided to concentrate on what news outlets were saying by mining news content from across the world with the goal of extracting the same insights we mentioned above.

So, what did we do?

We started by building a very simple search using our News API to scan thousands of monitored news sources for articles related to the leak. In total we collected over 4,000 articles which were then indexed automatically using our text analysis capabilities in the News API.

This meant that key data points in those articles were identified and indexed to be used for further analysis:

  • Keywords
  • Entities
  • Concepts
  • Topics

Search used:

“Panama Leaks” OR “panama papers” OR “Mossack Fonseca” (Try it in our demo)

API call:

https://api.newsapi.aylien.com/api/v1/time_series?period=%2B1HOUR&text=%22panama+leaks%22+or+%22panama+paper%22+or+%22Mossack+Fonseca%22&published_at.start=NOW-3DAYS&published_at.end=NOW

Note: The visualizations we created below were generated at the time of the analysis. Given the rate of new content surfacing we plan on updating these regularly.  

With the stories gathered we decided to dive into them and attempt to extract any interesting data points that the API could surface.

 

Analysis

The first thing we looked at was how the story developed over the past few days following the original story breaking. You can see the news chatter around the topic developing on the evening of April 3rd when The Guardian ran their original story; Revealed: the $2bn offshore trail that leads to Vladimir Putin.

We used the Time Series endpoint in the News API to graph the volume of stories over time.

The graph shows how the volume of stories increases as the story spreads and other timezones come online. We’ve noted some of the more prominent stories by choosing the ones with the highest volume of social shares, which can be easily extracted with our API.

 

Volume over time:

What, Who and Where?

The second thing we wanted to look at was what was being discussed, which individuals, organizations and countries in particular were mentioned in the articles and how often were they mentioned.

We used the API call below to extract any mentions of Entities and Concepts in the articles indexed. The main entities we were focusing included; keywords, people, organizations and countries.

API Call Entities:

https://api.newsapi.aylien.com/api/v1/trends?text=%22panama%20leaks%22%20or%20%22panama%20paper%22%20or%20%22Mossack%20Fonseca%22&language%5B%5D=en&published_at.start=NOW-3DAYS&published_at.end=NOW&field=entities.body.links.dbpedia

API Call Keywords:

https://api.newsapi.aylien.com/api/v1/trends?text=%22panama%20papers%22%20OR%20%22panama%20leaks%22%20OR%20%22Mossack%20Fonseca%22&language%5B%5D=en&published_at.start=2016-04-04T06%3A00%3A00Z&published_at.end=2016-04-04T12%3A00%3A00Z&categories.confident=true&field=keywords

 

Keywords:

Countries:

People:

Organizations:

Trends:

The final piece of analysis, while quite basic, was surprisingly interesting. Using the News API’s Trends endpoint, we looked at how the entities and concepts extracted developed over time as more and more stories broke.

It’s clear the likes of Vladimir Putin was implicated from the start but it’s interesting to see how the likes of David Cameron, Lionel Messi and Xi Jinping were only mentioned following further investigation and coverage.

Entities:


We’re planning on running some further analysis as the story develops. Stay tuned to the blog for updates to the data viz’s and further blog posts.

If you you’d like to try it for yourself just create your free News API account and start collecting and analyzing stories. Our News API is the most powerful way of searching, sourcing and indexing news content from across the globe. We crawl and index thousands of news sources every day and analyze their content using our NLP-powered Text Analysis Engine to give you an enriched and flexible news data source.

 




News API - Sign up




0