RapidMiner Wisdom recap: The value in analyzing unstructured data sources

We’ve just returned from RapidMiner Wisdom 2016 in New York City. This fantastic event organised by our partners RapidMiner, saw data scientists, academics and the newly coined citizen data scientists gather to discuss everything RapidMiner and predictive analytics related.

We saw some really great presentations on a wide range of topics. RapidMiner evangelists spoke about the different ways in which they’re all using RapidMiner’s predictive capabilities to ingest and mine a huge range of data sources to create actionable business outcomes. We had talks on Process Mining at Siemens to Risk Analysis at PWC and were even introduced to a super cool AYLIEN-powered stock recommender system called Mr.INV, built by the impressive guys at NYU.

We were primarily at RapidMiner wisdom to showcase our new Text Analysis Extension for RapidMiner, a text analytics plugin for RapidMiner that brings the smarts of Natural Language Processing and Text Analysis to any RapidMiner process.

Perhaps we’re somewhat biased, but the key thing we took away from the talks was how crucial, mining unstructured data, particularly text was in each speaker’s data analytics process.

There’s no doubt that the benefits of mining unstructured datasets is coming to the fore in the data mining world. There are 2 reasons for this: the sheer amount of unstructured content being created today and the realization that there is a wealth of information hidden in this user generated content (social conversations, email, news articles, blog posts, etc).

But mining this information can be quite the challenge…

“More content was uploaded yesterday than any one human could ever consume in their entire lifetime”

– Conde Nast, 2015

And it’s not just a problem online and on the web, unstructured now accounts for nearly 90% of enterprise data.

So while this content we’re generating is rich in useful business and research insight, for the most part it’s unstructured and therefore extremely difficult to understand and mine, especially at the scale required. While it’s easy for a human to read some content and pull out the key points and extract what matters, it’s impossible for us humans to stay on top of the amount of content being created every second of everyday and while it’s getting easier it’s still difficult for machines to make sense of this unstructured information in the same way they do with structured data.So what’s the difference between Unstructured and Structured data and why is it important to leverage unstructured sources in our Data strategies?

Structured & Unstructured data

So what’s the difference between Unstructured and Structured data and why is it important to leverage unstructured sources in our Data strategies?

Unstructured data is primarily user generated, it’s not stored in a traditional table or database, it’s noisy and there is a hell of a lot of it. Structured data however differs in that it’s easily referenceable, it’s stored in a table or database, it’s often numbers-heavy and it’s easily ingested by machines or computers.

Our presentation at Wisdom focused on the voice of the customer and the benefits of analyzing unstructured content on social media to mine public opinion towards a certain event or even a brand.

To showcase what could be done we took a real world event, in this case Web Summit 2015, collected tweets about the event and visualized our findings, all within RapidMiner.

Some of the highlights of the talk:

The Web Summit saw 42,000 attendees from 134 countries descend on Dublin. There were 1,000 speakers including some pretty big names in the tech industry (Michael Dell, John Scully, Ed Catmull, Benedict Evans). In total we collected 199,054 tweets over 3 days and 4 nights, analyzed them in RapidMiner with our Text Analysis extension and visualized our findings in Tableau.


We wanted to see how much chatter there was around the different speaker sessions. To do this we extracted the names of the speakers mentioned and graphed the volume over the 3 days. The graph below shows the volume of tweets with a mention of one of the speakers. You can clearly see spikes in volume when they hit the stage to speak. Tweets mentioning Paddy Cosgrave, Web Summit’s founder stayed pretty constant throughout and not too surprisingly, John Scully had probably the biggest reaction of all the speakers extracted from the data set.

Volume of tweets:

While the activity was quite constant over three days you can see three major spikes in the volume of Tweets which represent each day. It’s pretty clear from this that people were enjoying themselves too much at the Night Summit out drinking Guinness in all the pubs in Dublin to be tweeting with the drop in volume as the day progressed. There was also a pretty evident dip in activity during lunch which suggests everyone was too busy taking advantage of the networking opportunities the lunch break provided.

The other thing we did was tried analyzed the polarity of the tweets we mined i.e. whether they are positive or negative. We hoped to get a feel for people’s reactions to the event by mining and analyzing the voice of attendees through their expressions and activity on Twitter. Overall the sentiment of the event was quite positive, however there were some negative trends that creep in throughout the event.

Polarity of Tweets:

One of the key themes we noticed was a lot of negativity towards certain aspects of the event, any who attended the event this year can probably guess what aspect of the event people were most vocal and frustrated about. In 2014 it was the Wi-Fi and in 2015 the food on offer at Web Summit was certainly an interesting talking point online.

Negative theme (Food):

From the graph above where we’ve mapped the number of tweets over time mentioning the food and charted the steep decline in the polarity of tweets as people began to get frustrated. You can also see the reaction of the attendees to the Web Summit coming out and addressing the issue and taking it on the chin, with the spike back into the green zone.

The key message we wanted to get across in our talk wasn’t just how easy it is to do these complex content analysis processes in RapidMiner but also the wealth of information that is hidden in unstructured content online that is too often overlooked in data mining strategies.

We’ll be talking more about this topic on the 16th of February in a joint webinar with RapidMiner where we’re going to analyze the public reaction to the Super Bowl 50 ad wars to try and determine which brand comes out on top. Sign up here.

Text Analysis API - Sign up

Let's Talk