Analyzing Text in RapidMiner – Part 1
One of the major challenges with mining the Web and Social Media for insights is trying to get all of your data into one place. To do this, you need to extract information from multiple sources in order to gain an accurate and holistic view.
Combining multiple data sources and analyzing their content can be a daunting task, but thankfully data mining frameworks such as RapidMiner and Weka make it easy to extract information from multiple sources in a quick and straightforward manner.
In this blog post, we’re going to show you how to use AYLIEN’s Text Analysis API from within RapidMiner to analyze text gathered from sources on the web.
The Web Mining extension for RapidMiner provides access to internet sources like web pages, RSS feeds, and web services. In this tutorial, we’re going to use it to make HTTP requests to the Text Analysis API. In part 2 we will use it to scrape information from web pages such as Rotten Tomatoes.
Step 1: Install Web Mining for RapidMiner
- Open the RapidMiner Marketplace by selecting Help > Updates and Extensions (Marketplace)
- Search the Marketplace for “Web Mining” and install the extension
Step 2: Setup the API call
The Web Mining package provides you with an operator for invoking external web services. This operator is called “Enrich Data by Webservice” and can be found in the Operators panel under Web Mining > Services > Enrich Data by Webservice.
- Drag and drop an instance of the Webservice operator into your process
- Select the operator to access its configuration parameters
- Set the following values for the parameters:
url: “https://api.aylien.com/api/v1/sentiment?mode=tweet&text=<%text%>” or if you’re using Mashape: “https://aylien-text.p.mashape.com/sentiment?mode=tweet&text=<%text%>”
request method: POST
- If you’re using Mashape:
query type: XPath
Here we are basically calling the
/sentiment endpoint of the Text Analysis API to analyze the sentiment of some text in order to find out if it’s positive, negative or neutral.
Step 3: Setup the input text
Now that our API call is setup, we need to provide the operator with some input text.
- Install the Text Processing extension, the same way you installed the Web Mining extension in Step 1
- Add an instance of the Text Processing > Create Document operator
- Select the Create Document operator and add some text by clicking Edit Text
- Add the Text Processing > Documents to Data operator to convert the Document to an ExampleSet, and set the
text attributeparameter to “text”
- Add the Web Mining > Utility > Encode URLs operator to URL-encode the text, and set the
url attributeparameter to “text”
- Finally, connect the URL-encoded text input to the Enrich Data by Webservice operator created in Step 2
Step 4: Run!
Now that we have everything setup, it’s time to run our process by clicking the Run button.
As you can see, “I love puppies!” was deemed to be positive and the result is now accessible in RapidMiner for further analysis and reporting. You could use one of the many other methods provided in the Text Processing package to generate any number of documents and analyze their sentiment in the same fashion. Also, by changing the
url parameter in the API call you can access any other endpoint from the Text API (Concept Extraction, Classification, Summarization and so on).
Next stop: analyzing movie reviews
In the 2nd part of this series, we’re going to crawl Rotten Tomatoes with RapidMiner to extract movie reviews and analyze their sentiment to gain some interesting insights.