Text Analytics meets 2014 World Cup tweets – Part 1
Note: if you can’t see the charts, please click here.
The FIFA World Cup is without doubt the biggest sporting event in the World, with millions of fans and viewers from all around the globe who use Social Media to share their thoughts and emotions about the games, teams and players and thus creating massive amounts of content on Social Media by doing so.
That’s why at AYLIEN we decided to collect some of this data using Twitter’s Streaming API and analyzed tweets related to the world cup, looking for interesting insights and correlations.
We are going to explore how you can use text analysis techniques to dig into some of this data in a series of blog posts.
In Part 1 of the series, we’re going to get a high-level view of our data, and also to look for some basic data insights about the tournament.
Data and Tools
Data: datasets used in this blog post are as follows:
- tweets.csv: Around 30 million Tweets (80 million including retweets – which are omitted) collected between June 6th and July 14th using the Twitter Streaming API, and filtered by some of the official World Cup hashtags (e.g. “#WorldCup” and “#Brazil2014”), as well as team code hashtags (e.g. “#ARG” and “#GER”) and Twitter usernames of teams and players. (Note: we’re assuming that Twitter samples the tweets in a uniform fashion and without any major side effect on their distribution)
- matches.csv: Information about the 64 matches, such as match time and results, obtained using the World Cup json project.
- events.csv: Information about match events such as goals, substitutions and cards, obtained using the World Cup json project.
Let’s start our quest by taking a look at the matches and their events, such as goals, substitutions and red and yellow cards:
Things to note:
- The number of matches with 5 or more yellow cards tends to increase in later stage games, possibly due to higher sensitivity and intensity of these matches.
Now let’s take a look at a breakdown of the most popular languages used in our tweets dataset:
Things to note:
- English, followed by Spanish and Portuguese are the three most used languages in our tweets dataset.
Next we’ll have a look at the distribution of geo-tagged tweets over different countries around the globe, along with their languages:
Tweets and events
Plotting the total volume of tweets over time shows a repeating pattern of spikes appearing at match times and also at times when a major event has occurred (such as elimination of a team, qualification for the next round, or shocking results). Let’s have a look at a few examples:
1. Tweet volume by Language
In these examples, we’re going to see how the volume of tweets in a language is affected by the matches and critical events related to teams from countries where that language is spoken (also note the trend lines in black):
Note: double click on the charts to zoom, click and hold to pan.
Teams: USA, England, Australia, Cameroon and Nigeria.
Teams: Germany and Switzerland.
Teams: France, Belgium, Algeria, Cameroon and Côte d’Ivoire.
Teams: Spain, Argentina, Mexico, Uruguay, Chile, Costa Rica, Ecuador, Honduras and Colombia.
Teams: Brazil and Portugal.
2. Tweet volume during matches
A similar pattern can be observed at a smaller scale during matches, with spikes appearing for each goal or major event. Let’s see an example from the Brazil – Germany match:
3. Tweet volumes for different teams
Finally, let’s take a look at how the volume of tweets that mention a team changes over time for the four teams that qualified for the semi-finals round (for each team we are counting mentions of the team’s full name e.g. “Germany” as well as its team code hashtag e.g. “#GER”):
Subscribe to our blog and stay tuned for part 2, where we use Text Analytics to dig deep into the tweets’ contents.
Got some cool use cases of text analysis? We would love to hear about them. Get in touch below.
Update: here is the second part of the series.