This blog post is a simplified version of a more technical experiment we conducted, in which we collected, analyzed and visualized about 77,000 Tweets during the Web Summit 2015. You may recall, last year, we did something similar around analyzing the Sentiment of tweets collected for our 2014 report of the Web Summit according to Twitter, but this year we wanted to take a different approach and have some fun with some common and not so common models and algorithms used in the NLP space.
If you’re a bit more of an algo geek and would like to learn more about how we did it, make your way to our step by step guide on the same experiment. Here you can download the datasets, code and the libraries we used.
First off, we wanted understand the tweets we collected to get a better handle of what people were talking about during the Web Summit. We decided it would be cool to then try and group similar tweets closer together and visualize them on a 2D map.
For a while we’ve been looking to put some classical and not so classical NLP methods to the test on social media data. So, using the following methods: tf-idf scoring, K-means clustering, Latent Dirichlet Allocation (LDA), averaged Word Vectors (using GloVe word embeddings), Paragraph Vectors (using gensim’s doc2vec implementation) and finally Skip-Thought Vectors, we built a series of somewhat interesting visualizations which we’ll run through in a way that hopefully the non-ML or NLP expert can understand.
We like pretty graphs here at AYLIEN, so in order to visualize the tweets on a 2d map we used a dimensionality reduction algorithm called t-SNE to give each tweet an (x,y) coordinate in a 2D space, this allows us to put the tweets on a scatter plot, you can read more about the process involved in the how-to guide.
Here’s what we used:
Tools and Libraries:
- Bokeh (for visualizations)
- scikit-learn (for clustering and dimensionality reduction)
- NLTK (tokenization and stopword removal)
- gensim (for paragraph vectors, i.e. doc2vec)
- Skip-Thought Vectors
- LDA for Python
Models and Algorithms:
- GloVe vectors trained on tweets [download]
- About 77,000 tweets collected between Nov 2nd and Nov 6th 2015, that mention one of the official Web Summit hashtags, and/or handles of a few hand-picked notable speakers [download bzipped JSON]
- Skip-Thought model files, trained on the BookCorpus dataset [download]
Tf-Idf stands for term frequency inverse document frequency, with tf-Idf we’re trying to get an understanding of how rare or or common a particular word might be in a document or in this case a collection of tweets. We took the tf-idf vectors for each tweet, and fed them to t-SNE to get them in the 2D space.
Here’s what it looks like when displayed on a 2D scatter plot. Each blue dot represents a tweet, you can hover over a dot to read a particular tweet. For an explanation on how it’s done see the step by step guide.
The tweets are separated out quite nicely, and have even formed some clusters or homogenous ‘islands’. For example, there’s an island of tweets about in the top left about “pub crawls”.
But there are two fundamental issues with this chart: there doesn’t seem to be a rigid notion of grouping and when we investigated further the separation of tweets is actually based more around keywords rather than concepts or semantics.
K-means is a popular clustering algorithm that distributes a predefined number of data points (K) grouping them together, close to the mean, in clusters. In this case we decided to create a manageable number of cluster (10) using MiniBatchKMeans from Scikit-learn, and then we fed the distance from each cluster to t-SNE to build a 2D plot.
To better understand what’s in each cluster, we looked at the top 10 features (words) for each of our 10 clusters, you can see them listed below:
Cluster 0: audipitch vote pitch unklerikoe support retweet websummithq best voting goes
Cluster 1: people talk smartwatch world need blocks way technology personal modular
Cluster 2: thanks great hosting server misconceptions blogging common site linux windows
Cluster 3: interested offer attend special maybe would startups trial sageone get
Cluster 4: ll dell michael we you love tomorrow it stage see
Cluster 5: day us meet come stand today visit dublin village green
Cluster 6: smile today see get live it like tech startup new
Cluster 7: great good day talk see dublin time looking today meet
Cluster 8: stage re we centre you marketing main come machine dublin
Cluster 9: dublin summit web ireland tech live day blog night startup
As you can see some of the consolidation is quite meaningful; Cluster 0 seems to be dealing with startups and pitch competitions, Cluster 1 encompasses technology-related topics, Cluster 2 seems to be about server technologies and Cluster 8 and cluster 9 seem to be about the conference itself
The separation has improved in this graph, but there are some overlaps between clusters and the islands are still formed around keywords and keyphrases.
Latent Dirichlet Allocation
Next we wanted to uncover some topics that were present in the tweets. To do this we used a well-known topic modeling algorithm called LDA, to uncover the latent topics in the tweets. In total we specified that we wanted to try and identify 15 topics.
We were then left with 15 “groups” of tweets that needed to be manually tagged to provide labels for each topic.
The topics were broken down as follows:
Topic 0: year lisbon tinder paddy free interview ireland like
Topic 1: like smile sheep ve cool love know got
Topic 2: vr world technology reality talk future drones robots
Topic 3: summit dublin web day ready start ireland morning
Topic 4: come stand booth today visit say hi tomorrow
Topic 5: mobile dell michael platform apps talk come fintech
Topic 6: stage live centre ceo dublin periscope nov tuesday
Topic 7: audipitch pitch vote smartwatch blocks support best open
Topic 8: dublin nightsummit tonight smile free night day party
Topic 9: great looking day thanks forward today good amazing
Topic 10: meet let startup today love dublin week great
Topic 11: data stage future talk design google machine iot
Topic 12: content marketing social digital media facebook video new
Topic 13: people don make iew ebstaff want need like
Topic 14: tech startup startups great business ireland sagementorhours stage
Most of these topics were in someway coherent; Topic 0/3 represents the conference itself, Topic 2 seems to be about Virtual Reality and the future of technology, Topic 4; networking, Topic 7 is about the pitch contests and wearable technology and Topic 8 deals with the ‘fun’ part of the conference (night outs, food, partying). While the likes of Topic 1 and 5 were quite random.
We were quite pleased with this chart, there are entire islands dedicated to marketing or partying, without those words getting explicitly mentioned. This showed a solid step towards a concept or topic-based representation of the tweets that wasn’t focused on Keywords.
The chart also showed tweets with similar topics mapped closer together. Pay attention to how close business (grey) and marketing (dark pink) are, and how far innovation/future (orange) is away from fun and partying (purple).
Does that mean the two can’t go hand in hand? Is this graph telling us something? 😉
We also wanted to put some of the less classical, more topical and popular approaches to NLP to the test on the collected tweets.
Averaged GloVe Embeddings
One of those “hot right now” approaches is using word embeddings as a Deep Learning based approach to language processing.
In simple’ish terms, Averaged GloVe Embeddings are dense vectors of typically between 25 and 300 dimensions that capture a lot of information about a word as well as its surrounding context. They can be seen as another form of dimensionality reduction like LDA. It’s an approach that has been shown to be very effective in a wide array of NLP tasks for a number of reasons.
These vectors have a set of characteristics that make them very special, vectors of similar words are closer to each other, and conversely, vectors of dissimilar words are far from each other meaning, the vectors can be combined to find answers to queries such as “king is to man as queen is to ???” by forming simple equations such as:
v(king) – v(man) = v(queen) – v(woman)
For more on how we implemented the model check out the how-to guide.
No clear separation is evident in this chart, however, there are areas that are clearly richer in one or two colors. So while the tweets aren’t separated nicely, the word vectors have worked their magic and neighboring tweets are a lot closer to each other semantically, without necessarily sharing the same keywords.
Paragraph Vector (doc2vec)
Paragraph Vector or Doc2vec is an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. It represents each document or in this case tweet, by a dense vector that’s trained to predict words in a document.
When we mapped the output of our Doc2vec model, we were a bit disappointed with the visualization. The separation and semantic similarity of neighboring tweets left a lot to be desired. However, we’re thinking it can be improved by plugging in word embeddings trained on an external corpus such as Wikipedia but we haven’t tried it yet.
Another Deep Learning-based representational model that has become more popular recently is Skip-Thought Vectors, also called Sent2Vec. It’s described as an approach for unsupervised learning of a generic, distributed sentence encoder. In simple terms using the continuity of text from books, an encoder-decoder model is trained, that tries to reconstruct the surrounding sentences of an encoded passage mapping sentences that share semantic and syntactic properties to similar vectors.
Here we used the implementation and models made available by Ryan Kiros in the “skip-thoughts” GitHub repository, to create Skip-Thought Vectors for our tweets.
For a more in depth look remember to check out our more technical how-to guide.
This visualization was somewhat of a success, there are some nice consolidations here, such as one where most tweets are about excitement and anticipation, but we couldn’t find a general pattern for topical similarity.
Maybe this is down to the difference between the two contexts: the model we used being trained on books and our documents being tweets, which may include, words and phrases that don’t exist in our model’s vocabulary.
We tried several different methods to build a semantic map of our 77k tweets. Each method had its own weaknesses and advantages. The approach using LDA looks the most intuitive to us so far but we’re hoping to tinker a little more with the Deep Learning-based approaches to see how they can be improved.
Here are a few things to we’re going to try next:
- Using a different topic model such as the Hierarchical Dirichlet Process (HDP)
- Trying the Word Mover’s Distance similarity metric, together with the GloVe vectors
Improving our doc2vec model by intersecting the model with word vectors obtained from an external, larger corpus, we just need to find one
- Training a new Skip-Thought model on tweets
Stay tuned for the next instalment in the series where, we’re going to explore in more detail, how the topics evolved over the course of the event and dive into the sentiment towards each topic.
If you’d like to try it out yourself all the data, models and code we used is available for you here.