Building a Text Analysis process for customer reviews in RapidMiner
In this tutorial we’re going to show you how easy it is to analyze customer opinion in reviews and social media content using the “Text Analysis by AYLIEN” Extension for RapidMiner. In particular we will walk you through building a review analysis process using our Aspect-Based Sentiment Analysis feature to mine and analyze customer reviews.
If you’re new to RapidMiner, or it’s your first time using the Text Analysis Extension, you should first read our Getting Started tutorial which takes you through the installation process and the basics behind using our extension. You can download a copy of the Extension in the RapidMiner marketplace.
N.B. If you haven’t got an AYLIEN account, you can sign up for a free account here. You’ll need this to use the Text Analysis Extension.
What is Aspect-based Sentiment Analysis (ABSA)?
The whole idea behind our ABSA features is to provide a way for our users to extract specific aspects from a piece of text and measure the sentiment towards each aspect individually. Our customers use it to analyze reviews, Facebook comments, tweets and input from customer feedback forms to determine not just the sentiment of the overall text but what aspects, in particular, a customer likes or dislikes from that text.
We’ve trained domain specific models for the following industries:
Building the review analysis process
So, here’s what we’re going to do:
- Analyze the sentiment of reviews collected in a CSV file
- Understand the top aspects mentioned and their sentiment (positive, negative or neutral)
- Run a correlation analysis on the words and aspects
- Visualize our findings
Here’s what our completed process will look like when we’re finished. In this tutorial we’re going to walk you through each step of the process and what operators we used to analyze hotel reviews.
Step 1. Analyzing reviews
For the purpose of this tutorial we are going to use a collection of reviews we gathered on one particular hotel from publicly available sources. Our reviews were listed in a CSV file which we loaded into RapidMiner using the Read CSV operator.
Loading the reviews is the easy part. As you can see from the completed process we start with a Read CSV operator that reads the file containing all our reviews from the disk. All you need to do is to specify the path to the file. We then use the AYLIEN Analyze Aspect-Based Sentiment operator to analyze the sentiment of each review in our file.
Remember to set your input attribute to “Review” and to choose your domain option in the parameters section of the operator. In this case we’re using the hotels model.
Once you’ve ran the analysis, a new ExampleSet will be generated under the Results tab, which contains a new column for showing a list of the aspects present in each review and the respective polarity (positive, negative or neutral) of each aspect.
You can see the identified aspects and their polarity listed in the column in yellow.
While we have the results listed in the ExampleSet, the format they are in means it’s a little difficult to analyze and visualize them further. Which is why need to spend some time cleaning and prepping the results.
Step 2. Prepping the results
In order to make sense of the data we are going to create word vectors through a tokenization process using a “Process documents to data” operator as shown in the image below. This operator is used to create word vectors from string attributes through tokenization and other text processing functions.
Before running our tokenization we duplicate our data using a Multiply operator which allows us to run two types of analysis in parallel with different end goals in mind using the same data. Which is why we have two separate “Process documents to data” operators in our process.
The ABSA Result Processor:
For the first process we’re going to tokenize our ABSA results (which are in the format “aspect:polarity”) using a simple whitespace split, and we’re going to assign weights to these newly created columns or features based on Binary Term Occurrences, i.e. a 1 if a specific aspect:polarity pair exists in a review, and 0 otherwise. You can see the various parameters we’ll use in the Parameters section below.
The Review Text Processor:
For the second processor we’re going to run some further text processing functions on the review text from our duplicated set.
First we’ll tokenize the text to create unigram tokens, we’ll then transform them all to lowercase, and clean the data by filtering out tokens that contain non-letter characters such as numbers and punctuation using a regular expression ([A-Za-z]*) and finally we’ll discard tokens shorter than 3 characters and remove all stopwords. All of these functions can be seen in the sub-process below.
So now that we have our sentiment analysis done and the data processed and cleaned, we’ll run some further processes which will make mining the data and visualizing it a little easier.
Step 3. Splitting and filtering results
First we’ll use a Split operator to separate out aspects and polarity attributes (which, if you recall, are in the format “aspect:polarity” in our data, e.g. “beds:positive”). In the parameters section you should choose the attribute you want to split and the split pattern. In this case we’re going to split by “:” in our results as shown below.
The ExampleSet generated should resemble the one below showing the attribute split into word_1 (Aspect) and word_2 (Polarity) columns:
Using the duplicated results we’re going to isolate both positive and negative results using a simple Filter operator which is also shown in the image above.
Your filtered ExampleSets should resemble the one displayed below, showing the positive aspects and their count. You will have also noted that we used a Sort operator to sort our results in descending order of total occurrences, i.e. which aspect:polarity pairs appeared most frequently in the entire review set, which will help in our visualization process.
Step 4. Correlation analysis
The final step before we visualize our results is running a correlation analysis between words used in our reviews and the positive, negative and neutral aspects. We want to see which words are most commonly used to express a certain sentiment (positive, negative or neutral) towards a certain aspect (e.g. beds).
Luckily in RapidMiner this is very easy to do using the Correlation Matrix operator. In order to use it however we first need to join the two ExampleSets that we created separately, so we’ll have the words and the aspect:polarity pairs in one dataset. To be able to do that, we need to assign numerical IDs to our results, which can be done with the Generate ID operators. Afterwards we’ll simply use the Join operator to merge these ExampleSets and feed the result to the Correlation Matrix operator.
Your Correlation Matrix should resemble the one below:
The higher the correlation coefficient (the values in the matrix), the stronger the correlation, with 1 being the highest and -1 the lowest, i.e. an inverse correlation.
Using the matrix table you can filter and identify words extracted from reviews that correlate with a certain aspect:polarity attribute and vice-versa, as shown in the example below where words like dirty, blankets, complaint, dingy and cigarette correlate with the negative references to cleanliness.
Step 5. Basic Visualization
Doing basic visualizations in RapidMiner is easy using either the Charts function or the Advanced charts capabilities in your Results tabs. Below we’ve used some simple bar charts to visualize our findings.
Negative aspects mentioned
Positive aspects mentioned
Polarity of aspects mentioned
You can download the entire RapidMiner Process and try it for yourself – Download the process
If you’d like to read more about how you can collect reviews for analysis using RapidMiner. Check out our tutorial on Scraping Rotten Tomatoes reviews with RapidMiner.
We’ve also found some useful customer review datasets which you can use if you’d like to build this process yourself using sample reviews.
- Review data sets for “Latent Aspect Rating Analysis”
- OpinRank Dataset – Rehttp://sifaka.cs.uiuc.edu/~wang296/Data/index.htmlviews from TripAdvisor and Edmunds
- Restaurant Reviews Dataset