Build a Text Mining workflow to scrape and analyze reviews from the web
Online review sites are the world’s repository of customer opinion – every day, hundreds of thousands of customers give publicly available feedback on their experiences with businesses. With customer opinion available on a scale like this, anyone can generate insights about their business, their competitors, and potential opportunities.
But to leverage these sites like this, you need to understand what is being talked about positively and negatively in the text of hundreds or thousands of reviews. Since analyzing that amount of reviews manually would be far too time consuming, most people don’t consider any kind of quantitative analysis beyond looking at the star ratings, which are too vague and can be frequently misleading.
So in this blog, we’re going to show you how to use Text Mining to quickly generate accurate insights from thousands of reviews. For this blog, we’re going to scrape and analyze restaurant reviews from TripAdvisor and show you how easy it is to build a robust sentiment analysis workflow without writing any code using import.io and the AYLIEN Text Analysis Add-on for Google Sheets.
We’ll break the process down into three easy-to-follow steps:
- We’ll show you how to use import.io to scrape reviews from TripAdvisor
- We’ll use the AYLIEN Text API Google Sheets Add-on to analyze the sentiment expressed in each review toward 13 aspects of the dining experience.
- We’ll show you the results of our sample analysis
As we mentioned, neither of the tools we’ll use require coding skills, and you can use both of them for free.
Why are star reviews not enough on their own?
Take a look at the difference between these three-star reviews (which are for the same branch of the same restaurant chain):
From looking at these reviews, you can spot two important things with the star ratings and the review texts:
- Even though the star rating is the same, one of the reviews is positive, the other is negative. This gap between the star rating and what the reviewer really thought is part of the reason Netflix recently ditched the star review system.
- The text review allows you to see why the review is positive or negative – the specific aspects that made their dining experience positive or negative.
So to get an accurate analysis of customer opinion from reviews, you need to read the text of every review. The problem here is that doing this at scale is extremely time consuming and pretty much impossible. But we can solve this problem using Text Analytics and Machine Learning.
How to use import.io to scrape reviews from TripAdvisor
In order to find out what people are saying about businesses, we first need to gather the reviews. For this blog, we decided to analyze customer reviews of Texas Roadhouse, ranked by Business Insider as America’s best restaurant chain.
We chose to compare reviews on their branch in Gatlinburg, Tennessee with the branch in Dubai – as this might let us see how customers in diverse regions are responding to the Texas Outhouse offering. Each of these branches had more than 1,000 reviews, which gives us a generous amount of data to analyze.
Usually, gathering data like this would involve writing code to scrape the review sites, but import.io makes this task a lot easier – they allow you to scrape sites by simply pointing and clicking at the data you want. You can sign up for a free trial here and see a handy introductory video here (but we’ll walk you through the process below).
Once you’ve picked which restaurant you want to analyze and you’ve signed up for a trial with import.io, open up the restaurant’s TripAdvisor page in import.io. To do this, just enter the URL in the New Extractor input box. If you point and click on the text of a review, import.io will scrape all of the reviews on the page and save them for you.
You’ve now scraped the reviews from a single page. But since you’ll probably want a lot more than the 10 reviews on each TripAdvisor page, we’ll show you how you scrape a few hundred in one go.
Scraping hundreds of reviews at once
You may notice that when you are browsing reviews of a restaurant on TripAdvisor, the page url changes every time you select the next 10 reviews – it adds “-or10” for the next ten results, “-or20-” for the following 10, and so on. You can see it in the URL right before the restaurant name.
In our Texas Roadhouse example, the URL goes from this:
Import.io allows us to scrape numerous webpages at once if we upload a list of URLs on a spreadsheet. So to gather 1,000 restaurant reviews, we need to upload a spreadsheet with 100 of these URLs, with the -”or10” increasing by 10 each time.
To make your life a little easier, we’ll share the simple, six-step workaround we used for this with you here:
Step 1: Select the URL of the second page of results containing reviews of the restaurant you want to analyze. In our case it’s https://www.tripadvisor.ie/Restaurant_Review-g295424-d2310358-Reviews-or10-Texas_Roadhouse-Dubai_Emirate_of_Dubai.html
Step 2: Open up a spreadsheet and fill the first three cells (A1, B1, and C1) with the URL, but only up to “-or10” – copy and paste the remainder of the URL to somewhere else for now (in our case we’ll cut “-Texas_Roadhouse-Dubai_Emirate_of_Dubai.html” and paste it to another cell).
Step 3: Edit the cells in B1 and C1 to end with “-or20” and “-or30”, respectively. Then select these cells, and extend the selected cells until you have 100 rows covered. Excel or Google Sheets will then follow the pattern you have set in the first three.
Step 4: since this is not the completed URL, you’ll need to add the end of the URL you have from step 2 to the end of the text in each cell. You can do this by selecting a cell in row A and typing “A1&[the rest of your URL],” and extending that cell’s format downwards again.
Step 5: copy and paste the values of this new column into column A, and save your spreadsheet. Your spreadsheet should now have one column with 100 rows.
Step 6: Open up import.io and create a new extractor, and open up settings. Click on Import URLs, and select the spreadsheet with your URLs and save them. Once you click Run URLs, import.io will start scraping the 1,000 reviews from the URLs you’ve given it. Once it’s done, download the results, and open the file in Google Sheets.
Analyzing the Sentiment of Reviews
So at this point, we’ve gathered 1,000 reviews of each Texas Roadhouse branch, with each review containing a customer’s feedback about their experience in the restaurant. In every review, customers express positive, negative, and neutral sentiment toward the various aspects of their experience.
AYLIEN’s Aspect-Based Sentiment Analysis feature detects the aspects mentioned in a piece of text, and then analyzes the sentiment shown toward each of these aspects. In this blog, we’re analyzing restaurants but you can also use this feature to analyze review of hotels, cars, and airlines. In the restaurants domain, the Aspect-Based Sentiment Analysis feature detects mentions of seven aspects.
Using our Text Analysis API is easy with the Google Sheets Add-on, which you can download for free here (the Add-on comes with 1,000 credits free so you can test it out). You can complete the analysis by following these three easy steps:
Step 1: Once you’ve downloaded the Add-on, it will be available in the Add-ons menu in your Google Sheets toolbar. Open it up by selecting it and clicking Start.
Step 2: Before you begin the Aspect-based Sentiment Analysis of your reviews, first select the option from the Analysis Type menu, then select all of the cells that contain your reviews.
Step 3: To begin the sentiment analysis, click Analyze. The Text API will then extract the aspects mentioned in each review one by one, and print them in three columns next to the review – Positive, Negative, and Neutral. These results will be returned to you at a rate of about three per second, so our 2,000 reviews should take around ten minutes to analyze.
Results of the Aspect-Based Sentiment Analysis
At this point, each line of your spreadsheet will contain a column of the reviews you gathered, a column of the aspects mentioned in a positive tone, one with the aspects mentioned in a negative tone, and one with aspects mentioned in a neutral tone. To get a quick look into our data, we put together the following visualizations by simply using the spreadsheet’s word counting function.
First off, let’s take a look at the most-mentioned aspects in all of the reviews we gathered. To do this, all you need to do is separate every aspect listed in Google Sheets into its own cell using a simple function, and then use a formula to count them.
To put each aspect mentioned into its own cell, we’ll use the Split text to columns function, in the Data toolbar. This function will move every word in a cell into a cell of its own by splitting the cell horizontally – that is, if a cell in column A has three words, the Split text function will move the second two words into the adjacent cells in columns B and C.
From the pie chart, we can see that the food and staff alone accounted for almost two thirds of the total mentions, and after that there’s a bit of a drop off. After these aspects, customers in these restaurants were concerned about was how busy the restaurant was and the value of the meal.
Knowing which aspects of the dining experience people were most likely to leave reviews about is useful, but we can go further and analyze the sentiment attached to each aspect. Let’s take a look at the sentiment attached to each aspect in each of the Texas Roadhouse branches.
To do this, use Google Sheets’ COUNTIF formula to count every time the Text API listed an aspect in the positive, negative, and neutral columns. Do this by creating a table with each aspect as rows and Positive, Negative, and Neutral as columns, and use the following formula: =COUNTIF(the range of cells that contain the aspects in each sentiment,”*aspect*”).
After you’ve entered the formula, fill it out correctly, like in the example below, where you can see the formula filled out to count the amount of times food is mentioned positively – =COUNTIF(B1:B988,“*food*”).
Once you’ve done this, fill in the results on a table like the one below, and then insert a chart from the Insert tab.
We chose a stacked bar chart, as it allows us to get a quick grasp of what aspects people were interested in and how they felt about each aspect. First off, take a look at the sentiment shown to each aspect by the reviewers of the Dubai branch. You can see that the reviews are very positive:
When we compare the reviews of the Dubai branch above with the Tennessee reviews, we can see immediately that the American branch received more positive reviews than its Dubai counterpart:
Interestingly, we can also see from the volume of the mentions of each aspect that customers in Dubai were more concerned with value than their American counterparts, where reviewers paid more attention to the restaurant staff (with most of this extra attention being negative).
These are just a few things that jumped out at us after a sample analysis of a couple of restaurants. If you want to get started leveraging TripAdvisor (or another review site) for your own research using the steps in this blog, sign up for a free trial with import.io here, and download our Google Sheets Add-on here (there’s no sign-up required for the Add-on and it comes with free credits so you can test it out).