## Introduction

In this tutorial we’re going to walk you through using the “Text Analysis by AYLIEN” Extension for RapidMiner, to build a “News Analyzer” that monitors and analyzes articles from a particular RSS feed, or feeds.

If you’re new to RapidMiner, or it’s your first time using the Text Analysis Extension you should read Part 1 of our Getting Started Blog which takes you through the installation process. Also, If you haven’t got an AYLIEN account, which you’ll need to use the Extension, you can grab one here.

So, here’s what we’re going to do:

2. Extract the main body of text and Title from the article with the Extract Article Operator
3. Analyze and categorize these articles using the Categorize Operator
4. Extract Entities from the article, mentions of People, Places, Organization etc. using the Extract Entities Operator
5. Visualize our results and make them more consumable and understandable

Please note: This tutorial assumes you have the Web Mining Extension for RapidMiner installed. You can download and install the Extension through the RapidMiner Marketplace.

## Extracting the articles and titles

To extract the relevant pieces of text from the URLs collected we can use the Extract Article Operator. This will pull the main body of text, the title and any image present directly from the URL.

To prepare our extracted text for analysis we use a Data to Document Operator. This will transform the dataset of text to a collection of documents making it easier to categorize.

As part of this process, you need to specify which column(s) in the ExampleSet contain the text you want to create a Document from.

The first thing we’re going to do with the extracted text is, try and get a high level understanding for what it’s about by categorizing it based on a particular taxonomy, in this case the IAB QAG taxonomy.

We’ll then add a Document to Data Operator which transforms our documents back to a dataset making it a lot more manageable.

## Extracting Entities

Finally, the last piece of analysis we’ll do on the text is extract any mention of an Entity (Keywords, People, Places, Organizations, % values, \$ values etc.) using an Extract Entities Operator.

The entire Process will look like the one below when it’s fully built:

## Results

Running the process is simple, just hit the play button. Your results will be displayed in an  ExampleSet tab like the one below. Each row will contain the extracted text and title, its appropriate categories as well as any Entities that were extracted separated out in columns.

RapidMiner let’s you display and visualize results of your Process really easily using simple charts and visualizations like the ones below, which can all be created using the Charts widget on the left hand side of your results display.

We put together a simple pie-chart below visualizing the categories of the articles extracted with our News Analyzer.

Pie-chart showing IAB categories of Articles analyzed:

For the Data Junkies among us however, you may want to export your results and visualize them using something else like Tableau for example, which by the way, there’s an integration on the way for.

So, there you have it, that’s how you build a News Analyzer Process in RapidMiner using “Text Analysis by AYLIEN” and the Read RSS Feed Operator. We’ve also put together a tutorial guide to analyzing the Sentiment of tweets, which you may also find interesting.

We’ve also created a repository for sample Processes, that we’ll be adding to on a regular basis. It will be a collection of use case focused RapidMiner Processes, that can be downloaded and imported directly in RapidMiner. You can find more info in the documentation section of our website.

## Introduction

“Text Analysis by AYLIEN” is an Extension made up of different Operators that allow you to analyze and make sense of textual data from within RapidMiner. The different Operators contained in “Text Analysis by AYLIEN” include the following:

• Sentiment Analysis
• Entity Extraction
• Language Detection
• Hashtag Suggestion
• Related Phrases

Getting started with the extension is easy. To run you through the setup and how to use it we’re going to do the following:

1. Install the Extension and get it up and running
2. Create and run a simple RapidMiner Process that will analyze the Sentiment of a sample piece of text from a Document
3. Create and run a Process that analyzes the Sentiment of text using an ExampleSet instead of a Document
4. Create and run a Process that uses the Search Twitter Operator to collect and analyze Tweets

## Installation

“Text Analysis by AYLIEN” can be found in the RapidMiner Marketplace, you can navigate directly to the Marketplace while in RapidMiner Studio by using the side panel.

Once you’ve installed the Text Analysis extension, you can find its Operators from within RapidMiner by simply searching for AYLIEN. Here you’ll see the list of Operators that were installed as part of the Text Analysis Extension.

## Credentials and Connecting

The first thing we need to do before we can start analyzing text, is make sure we’re connected to the AYLIEN API. You can configure your connections under settings and Manage Connections.

To connect to the AYLIEN API you need an App ID and API Key. If you haven’t already got yours you can grab one for free here.

Create a new connection of type “Aylien Text Analysis Connection”, add your credentials (App ID and API Key) as shown below and hit “Save all changes”.

Now we’re pretty much good to go, and can reuse the connection we just created in all Text Analysis Operators.

Note: Depending on your subscription plan, you are subject to per-minute and daily rate limits (60 calls/minute and 1,000 calls/day on the Free plan). Once you reach your per-minute limit, the Operator will wait for a few seconds before running the subsequent batches, repeating until all the rows have been analyzed. If you reach your daily limits, you’ll get an alert in RapidMiner, but the already analyze documents will be gracefully returned.

## Example 1. – Document Sentiment Analysis

As shown below the first thing we do is add an Analyze Sentiment (Document) Operator to our Process.

In this case, we’ve also added a Create Document Operator as shown in the screenshot below, where we’ll type or paste the text we want to analyze.

Add the text you want to analyze to the document, one of our teammates is a big softy and loves puppies, so for the purpose of this tutorial we’re just using a simple quote from him; “I love puppies”, which we’ll use to demonstrate the extension in action.

So now we’ve completed the bare bones of our Process. It’s setup to analyze the sentiment of a piece of text we’ve written in a document.

To run the Process, hit play, but before you do, make sure you connect your Operators to each other and the results ports (we’re always forgetting to do this!) and also select the Connection you created earlier, in the Analyze Sentiment Operator.

Your results will be displayed in a results tab, similar to below.

The next thing we want to show you is how you can analyze text from an ExampleSet rather than a Document. To do this we still use a Sentiment Analysis Operator, but the ExampleSet version instead of the Document version. We’ve bundled two versions of some of the Operators for your convenience, that you can choose from based on the data you’re analyzing. The two formats can be easily converted to one another using the Documents to Data and Data to Documents Operators.

## Example 2. – ExampleSet Sentiment Analysis

You’ll also need to add a Data Generation Operator, which will essentially create a basic ExampleSet with the text we want to analyze in it. Note that this could’ve been similarly obtained from a CSV file or a database.

Make sure they’re all connected again and hit run. In this case your results will look a little different and they’ll be stored in a table like below, with the Polarity and Subjectivity confidence scores listed first and the Polarity and Subjectivity results listed second.

The last thing we wanted to show you in our basic tutorial is, how you can leverage other data sources and operates as part of your Process or mashup. In this case, we’re going to combine the Search Twitter Operator with our Sentiment Operator to pull tweets into RapidMiner, analyze their sentiment and visualize the results.

## Example 3 – Using Search Twitter Operator

As we did in our previous example, we’ll use an Analyze Sentiment Operator and combine it with a Search Twitter Operator as shown in the screenshot below.

With the Search Twitter Operator you can create queries exactly as you would when using the Twitter search API. You can see in the right-hand side of the screenshot, below we’ve created a query that searches for the keyword “puppies” removing retweets (-rt) and links (-http) to reduce noise and duplications. We’ve made sure the search only pulls in English tweets, by setting our language to English using “en” in the language parameter. We’ve also put a limit of 20 Tweets on our results, but it’s up to you how many you analyze. However, you should keep in mind that we have rate limits of 60 calls/minute on our Free plan when building your search.

Hit run and your results will be displayed in an ExampleSet similar to the one below.

You can create some simple visualizations to display your results by going to the “Charts” section on the left-hand side of the results screen. We created a very simple pie-chart which shows the distribution of positive, negative and neutral tweets.

There you have it, analyzing text in RapidMiner has never been easier.

We’re really excited to see what kind of mashups and Processes RapidMiner users come up with. For the more seasoned RapidMiner user, we’ve also put together some more advanced tutorials/mashups, one using Twitter Search and the other utilizing RSS feeds.

We’ve also put together a repository for sample Processes that we’ll be adding to on a regular basis. It will be a collection of use case focused RapidMiner Process that can be downloaded and imported directly into RapidMiner. You can find more info in the documentation section of our website.

P.S. If you’ve built or are building something cool, tell us about it, we may even feature it on our blog!

### Introduction

In this tutorial we’re going to walk you through using the “Text Analysis by AYLIEN” Extension for RapidMiner, to collect and analyze tweets. If you’re new to RapidMiner, or it’s your first time using the Text Analysis Extension you should read Part 1 of our Getting Started Blog which takes you through the installation process. Also, If you haven’t got an AYLIEN account, which you’ll need to use the Extension, you can grab one here.

So, here’s what we’re going to do:

1. Collect tweets using the Search Twitter Operator
2. Analyze their Sentiment using the Analyze Sentiment Operator
3. Assign the tweets to different categories using the Categorize Operator
4. Visualize our results and make them more consumable and understandable

Create a new Process in RapidMiner and add a Search Twitter Operator. Build your desired search as you would using the Twitter search API. You can see from the screenshot below we’re searching for tweets containing the keyword “Samsung”. We’ve cleaned up our search a little by removing retweets (-rt) and links (-http). We’ve also restricted the number of tweets to collect to 20 and decided we only want to see English tweets by adding “en” in the language parameter. We’ve also indicated that we want only recent or popular tweets to be returned using the “Result type” parameter.

Firstly, we’ll have a look at what kind of results our search returns. Once you hit run (don’t forget to connect your Operators) the results from the Twitter search are displayed in an ExampleSet tab, like the one below:

So now we have a collection of 20 tweets stored in an ExampleSet that are ready to be  further analyzed. The first thing we’re going to do from an analysis point of view is, try and determine what the Sentiment of each tweet is, i.e. whether they are Positive, Negative or Neutral.

We do this by adding the Analyze Sentiment Operator to our Process and selecting “text” as our “Input attribute” on the right hand side, as shown in the screenshot below.

So now we have a relatively simple Twitter Sentiment Analysis Process that collects tweets about “Samsung” and classifies them according to their Polarity.

As is displayed in the ExampleSet below, the results now contain not only the tweets that were pulled in but their corresponding Polarity and Subjectivity as well as a confidence score for both.

So we’ve determined the sentiment of the tweets but like we said in the beginning, we also want to categorize them in some way. We can do this pretty easily by using the Categorize Operator from the Text Analysis Extension, but before we do we need to prepare our data for analysis.

Firstly we’re going to use a Data to Documents Operator to generate Documents from our existing data set making it easier to categorize.

We’ll then add a Categorize Operator which will basically classify our text based on a particular taxonomy, in this case we’re using the IAB QAG taxonomy, which is a standard used in the digital advertising industry for categorizing content.

Now our Process is starting to take shape, but because we previously transformed our data into documents before they were categorized, we need to reverse the process and create a dataset from the resulting categorized documents, which in turn will make it easier to visualize and understand as a whole.

So here’s what our completed Process looks like.

It collects tweets, analyzes the Sentiment of those tweets, prepares them for categorization against a taxonomy and displays the results in an ExampleSet, like the one below.

Cool, huh?

### Visualizing the Results

So we have our results stored in a table (ExampleSet) but in order to make them more presentable we want to visualize them a bit better.

RapidMiner let’s you display and visualize results of your Process really easily using simple charts and visualizations like the ones below, which can all be created using the Charts widget on the left hand side of your results display.

Bar chart showing # of positive, negative and neutral tweets:

Pie chart showing # of positive, negative and neutral tweets:

Pie chart showing a breakdown of tweets by their top-level category:

For the Data Junkies among us however, you may want to export your results and visualize them using something else like Tableau for example, which by the way, there’s an integration on the way for.

So, there you have it, that’s how you build a Twitter Mining process in RapidMiner using “Text Analysis by AYLIEN” and the Search Twitter Operator. We’ve also put together a tutorial guide to extracting and analyzing RSS feeds from popular news outlets, which you may also find interesting.

We’ve also created a repository for sample Processes that we’ll be adding to on a regular basis. It will be a collection of use case focused RapidMiner Processes, that can be downloaded and imported directly in RapidMiner. You can find more info in the documentation section of our website.

Sentiment Analysis is a well-known task in Text Analysis, and it’s defined as the use of Natural Language Processing, Machine Learning and Computational Linguistics to identify and extract subjective information in source materials. It’s also commonly known as opinion mining.

Extracting and understanding opinions from text is an extremely hard thing for machines to do, heck it’s even difficult for humans to decide on a piece of text is positive or negative. There are a number of reasons for this, there can be mixed sentiment in a piece of text, there can be sarcastic tones present, the presence of slang or short-hand writing, etc.

Sentiment Analysis is an area of hot debate and active research in the data mining world. It’s a data analysis technique often bad mouthed and trashed as inaccurate and misleading.

So why is it such a hot topic then? Why haven’t people just given up on it? Why are companies and researchers still fixated on solving this problem?

In short it’s because of the opportunity out there, there is a wealth of information hidden in user generated content (news articles, reviews, Tweets, Facebook posts, Instagram comments) and the shear rate at which we’re creating this sort of content online means, human analysis just isn’t able to keep up without the help of modern technology. Being able to mine text for opinions is big business for brands, governments and researchers. Analyzing opinions on social media is the modern day focus group the only difference is you’re getting honest feedback and opinions from outside of a controlled environment.

This is why, at AYLIEN, we’re focused on constantly keeping our Sentiment Analysis models to the highest standard possible when it comes to accuracy (precision, recall and confidence) and why we’re constantly evaluating and updating our approach to the problem.

We recently updated our sentiment model and so far following our testing and customer feedback we’re really happy with the improvements we’ve seen.

### Accuracy

So, firstly, we’ve seen an improvement in how accurate our system is. State-of-the-art performance for Sentiment Analysis systems on Twitter data is believed to be around ~80% accuracy. Following tests on our updated models we’ve had an overall (7-8%) increase in accuracy compared to our previous model, which actually takes us into the ~80% range, so closer to, and in some cases better than, state-of-the-art (yay!).

Pro tip: if anyone tells you their Sentiment Analysis is 100% accurate, especially on Social data…you should turn around and run as fast as you can.

### Confidence Scores

Second, we’ve also significantly improved how we calculate our confidence scores to ensure our end users know how confident we are on each prediction we make. As you may have heard us say before, a good Sentiment Analysis solution should not only be accurate in its results but it should also know when it might be wrong, so the accuracy of the confidence score is as important as the actual prediction, if not more.

Finally, we’ve seen some massive improvements in how we handle Negation in text.

### Negation and Mixed Sentiment

Like we said at the beginning, one of the main challenges with understanding sentiment and opinions, is the complexities involved in how us human beings express our thoughts and opinions, and form a message.

Often when people give feedback about something, it’s a mix of things they liked and disliked about that thing. So for instance you might say “I like the battery life of this phone, but the screen sucks!”.

Also some messages are commonly expressed as a negation of another message, so we typically say “I don’t like the food” instead of “I dislike the food”.

Both of these complexities would impose challenges for Sentiment Analysis systems. It means that you can’t rely on only the “polar” words (e.g. “like”, “love”, “hate”, etc) but you also need to take their context, and the general structure of the sentence into account, in order to make better judgements.

With this release we have fixed a lot of the issues our previous models had with negation.

### Let’s see a few examples:

Input:

“I don’t like their food”

Output:

{
polarity :  negative,
subjectivity :  subjective,
text :  i dont like their food,
polarity_confidence :  0.6314665758824843,
subjectivity_confidence :  0.9999774309011896
}

Input:

“I don’t like their food, but the service is great”

Output:

{
polarity :  positive,
subjectivity :  subjective,
text :  i dont like their food, but the service is great,
polarity_confidence :  0.8229126377773636,
subjectivity_confidence :  0.9999797608112301
}

Input:

“I like their food, but the service is terrible”

Output:

{
polarity :  negative,
subjectivity :  subjective,
text :  i like their food, but the service is terrible,
polarity_confidence :  0.9981542081035445,
subjectivity_confidence :  0.9999999992706756
}

We enjoy working on hard problems at AYLIEN and they don’t come much harder than teaching machines to understand opinions from text. Sentiment Analysis is something we’re constantly working on, we’re regularly updating and tinkering with our models to off the best, most accurate service we can.

So what’s next?

Well, we can’t tell you much, but we’ve been working hard on a ground-breaking Sentiment Analysis pipeline that we will be launching later this year. So stay tuned!

Give it a try, check out our Sentiment Analysis demo.