###### Good Contents Are Everywhere, But Here, We Deliver The Best of The Best.Please Hold on!

It takes less than 4 minutes for a piece of news to spread across Online, TV and Radio. In the office or on the road, Streem connects you with news monitoring from each source, mention alerts, keyword and industry tracking, and realtime audience analytics – delivered live to your Desktop or Mobile device within a minute of publication or broadcast.

Track competitors, produce reports, take news intelligence wherever you go with Australia’s fast, flexible and trusted news intelligence platform.

## Background

Media monitoring and aggregation apps are changing the way news is discovered and consumed. Competition within this space is increasing immensely as each new app and service promises to deliver a more personalized and streamlined experienced than those that have come before. Ultimately, the winners and losers in this battle for market share will be decided by those who best understand the content they are sharing, and use this knowledge to provide cutting edge personalization, in-depth analytics and reader satisfaction.

## The Challenges

Frustrated with both the accuracy and ROI they were seeing from an incumbent solution, Streem decided to evaluate their options in sourcing an alternative provider. They had 3 key points of consideration in evaluating and benchmarking various solutions; Performance, Cost and Setup investment.

Streem’s customers require targeted, informed and flexible news alerts based on their individual interests. Therefore, what the team at Streem required was a fast, API-based service that allowed them to analyze large streams of content in as close to real-time as possible.

Dealing with vast amounts of content, Streem needed the ability to intelligently identify mentions of people, organizations, keywords and locations while categorizing content into standardized buckets. An automated workflow would allow them to scale their monitoring beyond human capabilities and deliver targeted news alerts as close to publication time as possible.

## The Solution

Using the AYLIEN Text Analysis API, Streem have built an automated content analysis workflow which sources, tags and categorizes content by extracting what matters using entity/concept extraction and categorization capabilities.

Key points of information are extracted from each individual piece of content and then analyzed using Natural Language Processing (NLP) and Machine Learning techniques, providing Streem with a more accurate solution, faster time to value and an overall greater return on investment.

“The accuracy of Aylien was higher than competing providers, and the integration process was much simpler.” -Elgar Welch, Streem

### Endpoints used

Streem are using our Entity and Concept Extraction endpoints to identify keywords, mentions of people, places and organizations along with any key information like monetary or percentage values in news articles and blogs, and our Classification endpoint to then categorize content into predefined buckets that suit their users taste.

Let’s take a closer look at each endpoint and how Streem use them within their processes;

#### Entity Extraction

The Entity Extraction endpoint is used to extract named entities (people, organizations, products and locations) and values (URLs, emails, telephone numbers, currency amounts and percentages) mentioned in a body of text or web pages.

Here’s an example from our live demo. We entered the URL for an article from the Business Insider and received the following results;

As you can see from the results, mentioned entities are extracted and compiled. By extracting entities, Streem can easily understand what people, places, organizations, products, etc., are mentioned in the content they analyze, making it easy to provide relevant, targeted results to their users.

The Concept Extraction endpoint extracts named entities mentioned in a document, disambiguates and cross-links them to DBpedia and Linked Data entities, along with their semantic types (including DBpedia and schema.org types).

#### Classification

The Classification endpoint classifies, or categorizes, a piece of text according to your choice of taxonomy, either IPTC Subject Codes or IAB QAG.

We took this TechCrunch article on Tesla Motors and analyzed the URL and received the following classification results;

Note the the two columns labelled Score and Confident?. By providing confidence scores, users can define their own parameters in terms of what confidence levels to accept, decline or review.

## The outcome

Streem now ingest and analyze tens of thousands of pieces of content on a daily basis in near real time. Their backend process, powered by the AYLIEN Text Analysis API, extracts key pieces of information on which their users can build tailored, flexible searches, alerts and informed monitoring capabilities around news events that matter to them.

Using AYLIEN’s state of the art solutions, the team at Streem now have more time to invest in their own product offering, delivering the best news aggregation service possible to their users.

## Introduction

In recent years, the monitoring of social media and news content has become a major aspect of business intelligence strategies, and with good reason too. Analyzing the voice of the customer provides extensive and meaningful insights on how to interpret and learn from consumer behavior. With over 2.3 billion active social media users out there, there’s a wealth of information being generated every second of every day, across a variety of social channels.

Direct access to consumer opinion was traditionally only available in closed and controlled environments like surveys and feedback groups, but today it’s accessible everywhere on the web; on social media, in reviews, in blogs and even news outlets. Hence why it’s been dubbed the modern day focus group.

Across social channels, a staggering 96% of people will talk about a brand without actually following its social media accounts. So while a company may be actively responding to direct messages and queries to its own channels, if they’re not paying attention to what is being said elsewhere, they’re missing out on a goldmine of useful and often freely available data and opportunity.

So why isn’t every company out there keeping track of every mention of their brand or products online; there’s simply too much information out there to manually keep track of. Not alone does the average Internet user have 5.54 social media accounts, but the sheer volume of chatter and content generated among them is so vast, it would be impossible to even attempt keeping up with.

When talking about a company or brand online, in some cases, the consumer will aim their message directly at the company Facebook page or Twitter handle, where it can be picked up and acted upon by a customer care rep. But what about the comments that aren’t written as a direct message?

Depending on the size of your industry, following, or customer base, there could be thousands, if not millions of similar messages scattered across the various social media channels and online review sites. It’s a mammoth challenge, but one that is being conquered and taken advantage of by savvy organizations out there. Enter Text Analysis and Natural Language Processing (NLP).

Recent advancements in Text Analysis and NLP are enabling companies to collect, mine and analyze user generated content and conversations, a level of insight and analysis at a scale that was previously not possible. If you’re not tapping into the wealth of data out there and monitoring each and every mention of your company, brand, product line or even competitors online, you’re missing out on a number of key business opportunities;

• Crisis prevention and damage limitation
• Research and product development
• Customer support and retention

### Crisis prevention and damage limitation

While social media is, for the most part, a public forum, many interactions between a customer and a company online will not be seen by the greater public. In many cases, direct messages to companies on social are handled swiftly and taken to private messaging, out of the public eye, where they can then be handled via email or phone call. However, it is also vital to track, compile and analyze each non-direct interaction and mention of your brand in order to spot any potentially dangerous trends that may be developing. You may, for example, begin to notice a sharp increase in the number of customers complaining about a specific aspect of your product.

What begins on social media as a customer complaint or grievance, can very quickly snowball into something far more serious, and wind up in mainstream news media, which is truly the last place you want to see your brand being portrayed in a negative light, as it’s reach and potential virality holds no bounds.

Let’s look at Samsung’s recent exploding battery crisis. On August 24, a report of an exploding Samsung Note appeared on Chinese social network Baidu. While it received some attention, one-off stories like this are often attributed to be exactly that, a one-off.

One week later, however, a second and similar report emerged from Korea. These reports were suddenly no longer refined to social channels as mainstream media quickly picked up on a developing story surrounding one of the world’s leading tech companies.

While Samsung were left with no choice but to recall and cease production of the Note 7, this is a prime example of how a crisis can begin with a couple of posts on social media channels and ultimately end up as one of the biggest crises the company has ever had to face.

Although the problem lays in the production of the Note 7, what is interesting to observe is the period of 6-7 days after the initial report of an exploding phone in China. Looking at news sources from this period, there appeared to be no increase in negative publicity for Samsung. In fact, the number of stories about Samsung decreased in the days following the post on Baidu.

The volume of stories written about Samsung trebled almost overnight after reports of a second Note 7 explosion in Korea

As soon as that second explosion was reported in Korea, however, the number of stories being written about Samsung trebled almost overnight.

The successful launch of a product or campaign relies heavily on the initial consumer reaction.. Early negative reviews can be difficult to recover from, but by monitoring consumer reactions you are giving yourself a golden opportunity to spot problems early, resolve them in a timely manner and prevent any initial negativity from snowballing.

You can quickly get a picture of what your customers are talking about, what keywords and topics appear most frequently in their commentary and whether the overall sentiment is positive or negative. This doesn’t stop with your own customers, however. You can learn just as much by monitoring mentions of your competitors and their products online.

### Product research & development

From that initial lightbulb moment to the day of product launch, many opinions will be voiced about the direction this process should take. While many will have their say and provide their input, decisions that are made based on solid research data will give the product its best chance of success, both on launch day and beyond. It can be crucial, particularly in the early stages of the process, to identify trends and perform audience segmentation to help define the scope and direction it will take.

Initial research focusing on the consumer need that is to be addressed with the new product or service can focus on a number of key areas, to help pinpoint that market niche. By monitoring the voice of the customer and their reactions to existing competitors, you can quickly develop an understanding about what they are doing well, and what they (or you!) could improve on. By monitoring their  comments at scale, it’s possible to spot certain product or service aspects that are pain-points for your potential customers and react to those business insights.

A great example of this strategy in action was when L’Oreal used social listening to track the challenges people faced when dyeing their hair, what kind of tools they were using and and the color effects they desired. Not only did they uncover the trends consumers were following most, they also gained a solid understanding of the issues potential customers faced – which they could solve with help from the company’s R&D department. The resulting launch of their Feria Wild Ombre product proved to be hugely successful and helped L’Oreal widen their market as it appealed to consumers who had previously not been hair-dye users.

L’Oreal’s targeted social listening campaign proved highly successful with the launch of their Feria Wild Ombre range

Monitoring for product development doesn’t stop on launch day, however. Consumer reactions and opinions going forward are equally as important as they were pre-launch. You may be their hero today, but things can very quickly take a turn for the worse, so it is important to continuously track these opinions, learn from them, and ensure your product evolves accordingly.

### Customer support and retention

People love sharing on social media. Whether we’ve just bought a shiny new car, adopted a pet or passed an exam, the chances are that many of us will share our joyous news online. However, should our new car suddenly break down, our resulting online complaints are likely to be seen by twice as many people as our initial positive posting. It’s a harsh reality that companies with an online presence simply have to accept. However, how they chose to monitor and manage such instances can be the crucial differentiator between keeping a customer or losing them to a competitor.

Image: helpscout.net

Clearly, it’s essential to keep on top of negative mentions online and provide a quick solution. We say quick because social media complainers aren’t willing to wait for 1-2 business days to get a reply. In fact, 53% of people expect a brand to respond to their Tweet in less than an hour. If you’re not listening to your customers, your competitor soon will.

It’s not all about fighting fires and resolving customer complaints on social media, however. A report from the Institute of Customer Service showed that 39% of consumers surveyed actively provide feedback to organizations online, while 31% make pre-sales enquiries. These are positive actions that companies can not only profit from, but also analyze in the same way they would negative actions. By looking at and analyzing every angle, a 360 view of consumer perception can be obtained, which enables a company to spot trends and establish their strengths and weaknesses.

## Conclusion

Bringing it all together, we hope that we’ve provided you with some food for thought in relation to how important social media and news monitoring can be to the (initial and ongoing) success of an organization. From idea generation, to tracking your competitors and pleasing/retaining your customers, it can help you to make sense of large amounts of unstructured data online and uncover insights and trends that can boost decision making, influence the evolution of product development and help minimize the risk of damaging press from emerging.

Unsupervisedly learned word embeddings have seen tremendous success in numerous NLP tasks in recent years. So much so that in many NLP architectures, they are close to fully replacing more traditional distributional representations such as LSA features and Brown clusters.

You just have to look at last year’s EMNLP and ACL conferences, both of which had a very strong focus on word embeddings, and a recent post in Communications of the ACM in which word embeddings is hailed as the catalyst for NLP’s breakout. But are they worthy of the hype?

This post is a synopsis of two blogs written by AYLIEN Research Scientist, Sebastian Ruder. You can view Sebastian’s original posts, and more, on Machine Learning, NLP and Deep Learning on his blog.

In this overview we aim to give an in-depth understanding of word embeddings and their effectiveness. We’ll touch on where they originated, we’ll compare popular word embedding models and the challenges associated with them and we’ll and try to answer/debunk some common questions and misconceptions.

We will then look to disillusion word embeddings by relating them to literature in distributional semantics and highlighting the factors that actually account for the success of word embedding models.

## A brief history of word embeddings

Vector space models have been used in distributional semantics since the 1990s. Since then, we have seen the development of a number models used for estimating continuous representations of words, Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA) being two such examples.

The term word embeddings was originally coined by Bengio et al. in 2003 who trained them in a neural language model together with the model’s parameters. However, Collobert and Weston were arguably the first to demonstrate the power of pre-trained word embeddings in their 2008 paper A unified architecture for natural language processing, in which they establish word embeddings as a highly effective tool when used in downstream tasks, while also announcing a neural network architecture that many of today’s approaches were built upon. It was Mikolov et al. (2013), however, who really brought word embedding to the fore through the creation of word2vec, a toolkit enabling the training and use of pre-trained embeddings. A year later, Pennington et al. introduced us to GloVe, a competitive set of pre-trained embeddings, suggesting that word embeddings was suddenly among the mainstream.

Word embeddings are considered to be among a small number of successful applications of unsupervised learning at present. The fact that they do not require pricey annotation is probably their main benefit. Rather, they can be derived from already available unannotated corpora.

## Word embedding models

Naturally, every feed-forward neural network that takes words from a vocabulary as input and embeds them as vectors into a lower dimensional space, which it then fine-tunes through back-propagation, necessarily yields word embeddings as the weights of the first layer, which is usually referred to as Embedding Layer.

Figure 1: A neural language model (Bengio et al., 2006)

The key difference between a network like this and a method like word2vec is its computational complexity, which explains why it wasn’t until 2013 that word embeddings became so prominent in the NLP space. The recent and rapid expansion and affordability in computational power has certainly aided its emergence.

The training objectives for GloVe and word2vec are another difference, with both geared towards producing word embeddings that encode general semantic relationships and can provide benefit in many downstream tasks. Regular neural networks, in comparison, generally produce task-specific embeddings with limitations in relation to their use elsewhere.

In comparing models, we will assume the following notational standards: We assume a training corpus containing a sequence of $T$ training words $w_1, w_2, w_3, \cdots, w_T$ that belong to a vocabulary $V$ whose size is $|V|$. Our models generally consider a context of $n$ words. We associate every word with an input embedding $v_w$ (the eponymous word embedding in the Embedding Layer) with $d$ dimensions and an output embedding $v’_w$ (another word representation whose role will soon become clearer). We finally optimize an objective function $J_\theta$ with regard to our model parameters $\theta$ and our model outputs some score $f_\theta(x)$ for every input $x$.

### Classic neural language model

The classic neural language model proposed by Bengio et al. [1] in 2003 consists of a one-hidden layer feed-forward neural network that predicts the next word in a sequence as in Figure 2.

Figure 2: Classic neural language model (Bengio et al., 2003)

Their model maximizes what we’ve described above as the prototypical neural language model objective (For simplicity, the regularization term has been omitted):

$J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \text{log} \space f(w_t , w_{t-1} , \cdots , w_{t-n+1})$.

$f(w_t , w_{t-1} , \cdots , w_{t-n+1})$ is the output of the model, i.e. the probability $p(w_t \: | \: w_{t-1} , \cdots , w_{t-n+1})$ as computed by the softmax, where $n$ is the number of previous words fed into the model.

Bengio et al. were among the first to introduce what has become to be known as a word embedding, a real-valued word feature vector in $\mathbb{R}$. The foundations of their model can still be found in today’s neural language and word embedding models. They are:

1. Embedding Layer: This layer generates word embeddings by multiplying an index vector with a word embedding matrix;

2. Intermediate Layer(s): One or more layers that produce an intermediate representation of the input, e.g. a fully-connected layer that applies a non-linearity to the concatenation of word embeddings of $n$ previous words;

3. Softmax Layer: The final layer that produces a probability distribution over words in $V$.

Bengio et al. also determine two issues with current state-of-the-art-models:

– The first is that Layer 2. is replaceable with an LSTM, which is used by state-of-the-art neural language models [6], [7].

– They also identify the final softmax layer (more precisely: the normalization term) as the network’s main bottleneck, as the cost of computing the softmax is proportional to the number of words in $V$, which is typically on the order of hundreds of thousands or millions.

Discovering methods that alleviate the computational cost related to computing the softmax over a large vocabulary [9] is therefore one of the main challenges in both neural language and word embedding models.

### C&W model

After Bengio et al.’s initial efforts in neural language models, research in word embeddings stalled as computational power and algorithms were not yet at a level that enabled the training of a large vocabulary.

In 2008, Collobert and Weston [4] (thus C&W) demonstrated that word embeddings trained on an adequately large dataset carry syntactic and semantic meaning and improve performance on downstream tasks. In their 2011 paper, they further expand on this [8].

In order to avoid computing the expensive softmax, their solution is to employ an alternative objective function: rather than the cross-entropy criterion of Bengio et al., which maximizes the probability of the next word given the previous words, Collobert and Weston train a network to output a higher score $f_\theta$ for a correct word sequence (a probable word sequence in Bengio’s model) than for an incorrect one. For this purpose, they use a pairwise ranking criterion, which looks like this:

$J_\theta\ = \sum\limits_{x \in X} \sum\limits_{w \in V} \text{max} \lbrace 0, 1 – f_\theta(x) + f_\theta(x^{(w)}) \rbrace$.

They sample correct windows $x$ containing $n$ words from the set of all possible windows $X$ in their corpus. For each window $x$, they then produce a corrupted, incorrect version $x^{(w)}$ by replacing $x$’s centre word with another word $w$ from $V$. Their objective now maximises the distance between the scores output by the model for the correct and the incorrect window with a margin of $1$. Their model architecture, depicted in Figure 3 without the ranking objective, is analogous to Bengio et al.’s model.

Figure 3: The C&W model without ranking objective (Collobert et al., 2011)

The resulting language model produces embeddings that already possess many of the relations word embeddings have become known for, e.g. countries are clustered close together and syntactically similar words occupy similar locations in the vector space. While their ranking objective eliminates the complexity of the softmax, they keep the intermediate fully-connected hidden layer (2.) of Bengio et al. around (the HardTanh layer in Figure 3), which constitutes another source of expensive computation. Partially due to this, their full model trains for seven weeks in total with $|V| = 130000$.

### Word2Vec

Word2Vec is arguably the most popular of the word embedding models. Because word embeddings are a key element of deep learning models for NLP, it is generally assumed to belong to the same group. However, word2vec is not technically not be considered a component of deep learning, with the reasoning being that its architecture is neither deep nor uses non-linearities (in contrast to Bengio’s model and the C&W model).

Mikolov et al. [2] recommend two architectures for learning word embeddings that, when compared with previous models, are computationally less expensive.

Here are two key benefits that these architectures have over Bengio’s and the C&W model;

– They forgo the costly hidden layer.

– They allow the language model to take additional context into account.

The success their model can not only be attributed to these differences, it importantly also comes from specific training strategies, both of which we will now look at;

#### Continuous bag-of-words (CBOW)

Unlike a language model that can only base its predictions on past words, as it is assessed based on its ability to predict each next word in the corpus, a model that only aims to produce accurate word embeddings is not subject to such restriction. Mikolov et al. therefore use both the $n$ words before and after the target word $w_t$ to predict it as shown in Figure 4. This is known as a continuous bag of words (CBOW), owing to the fact that it uses continuous representations whose order is of no importance.

Figure 4: Continuous bag-of-words (Mikolov et al., 2013)

The purpose of CBOW is only marginally different than that of the the language model one:

$J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \text{log} \space p(w_t \: | \: w_{t-n} , \cdots , w_{t-1}, w_{t+1}, \cdots , w_{t+n})$.

Rather than feeding $n$ previous words into the model, the model receives a window of $n$ words around the target word $w_t$ at each time step $t$.

#### Skip-gram

While CBOW can be seen as a precognitive language model, skip-gram turns the language model objective on its head: rather than using the surrounding words to predict the centre word as with CBOW, skip-gram uses the centre word to predict the surrounding words as can be seen in Figure 5.

Figure 5: Skip-gram (Mikolov et al., 2013)

The skip-gram objective thus sums the log probabilities of the surrounding $n$ words to the left and to the right of the target word $w_t$ to produce the following objective:

$J_\theta = \frac{1}{T}\sum\limits_{t=1}^T\ \sum\limits_{-n \leq j \leq n , \neq 0} \text{log} \space p(w_{t+j} \: | \: w_t)$.

## GloVe

In contrast to word2vec, GloVe [5] seeks to make explicit what word2vec does implicitly: Encoding meaning as vector offsets in an embedding space — seemingly only a serendipitous by-product of word2vec — is the specified goal of GloVe.

Figure 6: Vector relations captured by GloVe (Stanford)

To be specific, the creators of GloVe illustrate that the ratio of the co-occurrence probabilities of two words (rather than their co-occurrence probabilities themselves) is what contains information and so look to encode this information as vector differences.

For this to be accomplished, they propose a weighted least squares objective $J$ that directly aims to reduce the difference between the dot product of the vectors of two words and the logarithm of their number of co-occurrences:

$J = \sum\limits_{i, j=1}^V f(X_{ij}) \: (w_i^T \tilde{w}_j + b_i + \tilde{b}_j – \text{log} \: X_{ij})^2$

where $w_i$ and $b_i$ are the word vector and bias respectively of word $i$, $\tilde{w}_j$ and $b_j$ are the context word vector and bias respectively of word $j$, $X_{ij}$ is the number of times word $i$ occurs in the context of word $j$, and $f$ is a weighting function that assigns relatively lower weight to rare and frequent co-occurrences.

As co-occurrence counts can be directly encoded in a word-context co-occurrence matrix, GloVe takes such a matrix rather than the entire corpus as input.

## Word embeddings vs. distributional semantics models

Word embedding models such as word2vec and GloVe gained such popularity as they appeared to regularly and substantially outperform traditional Distributional Semantic Models (DSMs). Many attributed this to the neural architecture of word2vec, or the fact that it predicts words, which seemed to have a natural edge over solely relying on co-occurrence counts.

DSMs can be seen as count models as they “count” co-occurrences among words by operating on co-occurrence matrices. Neural word embedding models, in contrast, can be viewed as predict models, as they try to predict surrounding words.

In 2014, Baroni et al. [11] demonstrated that, in nearly all tasks, predict models consistently outperform count models, and therefore provided us with a comprehensive verification for the supposed superiority of word embedding models. Is this the end? No.

With GloVe, we have already seen that the differences are not as obvious: While GloVe is considered a predict model by Levy et al. (2015) [10], it is clearly factorizing a word-context co-occurrence matrix, which brings it close to traditional methods such as PCA and LSA. Even more, Levy et al. [12] demonstrate that word2vec implicitly factorizes a word-context PMI matrix.

While on the surface DSMs and word embedding models use varying algorithms to learn word representations – the former count, the latter predict – both types of model fundamentally act on the same underlying statistics of the data, i.e. the co-occurrence counts between words.

And so the question that we will focus on for the remainder of this post still remains:

Why do word embedding models still outperform DSM with very similar information?

## Comparison models

To establish the elements that attribute to the success of neural word embedding models, and illustrate how they can be transferred to traditional processes, we will compare the following models;

Positive Pointwise Mutual Information (PPMI)

PMI is a typical measure for the strength of association between two words.It is defined as the log ratio between the joint probability of two words $w$ and $c$ and the product of their marginal probabilities: $PMI(w,c) = \text{log} \: \frac{P(w,c)}{P(w)\:P(c)}$. As $PMI(w,c) = \text{log} \: 0 = – \infty$ for pairs $(w,c)$ that were never observed, PMI is in practice often replaced with positive PMI (PPMI), which replaces negative values with $0$, yielding $PPMI(w,c) = \text{max}(PMI(w,c),0)$.

Singular Value Decomposition (SVD)

SVD is among the more popular methods for dimensionality reduction and came about in NLP originally via latent semantic analysis (LSA). SVD factorizes the word-context co-occurrence matrix into the product of three matrices $U \cdot \Sigma \times V^T$ where $U$ and $V$ are orthonormal matrices (i.e. square matrices whose rows and columns are orthogonal unit vectors) and $\Sigma$ is a diagonal matrix of eigenvalues in

decreasing order. In practice, SVD is often used to factorize the matrix produced by PPMI. Generally, only the top $d$ elements of $\Sigma$ are kept, yielding $W^{SVD} = U_d \cdot \Sigma_d$ and $C^{SVD} = V_d$, which are commonly used as the word and context representations respectively.

Skip-gram with Negative Sampling (SGNS)

Aka word2vec, as shown above.

Global Vectors (GloVe)

As shown earlier in this post.

## Hyperparameters

We will focus on the following hyper-parameters:

### Pre-processing

Word2vec suggests three methods of pre-processing a corpus, each of which can be applied to DSMs with ease.

#### Dynamic context window

Normally in DSMs, the context window is unweighted and of a unchanging size. Both SGNS and GloVe, however, use a scheme that assigns more weight to closer words, as closer words are generally considered to be more important to a word’s meaning. Additionally, in SGNS, the window size is not fixed, but the actual window size is dynamic and sampled uniformly between $1$ and the maximum window size during training.

#### Subsampling frequent words

SGNS dilutes very frequent words by randomly removing words whose frequency $f$ is higher than some threshold $t$ with a probability $p = 1 – \sqrt{\frac{t}{f}}$. As this subsampling is done before actually creating the windows, the context windows used by SGNS in practice are larger than indicated by the context window size.

#### Deleting rare words

During the pre-processing of SGNS, rare words are also deleted before creating the context windows, which increases the actual size of the context windows further. The actual performance impact of this is insignificant, however, according to Levy et al. (2015)

### Association metric

In relation to measuring the association between two words, PMI is seen as useful metric. Since Levy and Goldberg (2014) have shown SGNS to implicitly factorize a PMI matrix, two variations stemming from this formulation can be introduced to regular PMI.

#### Shifted PMI

In SGNS, the greater the volume of negative samples $k$, the more data is being used and so the estimation of the parameters should therefore improve. $k$ affects the shift of the PMI matrix that is implicitly factorized by word2vec, i.e. $k$ k shifts the PMI values by log $k$.

If we transfer this to regular PMI, we obtain Shifted PPMI (SPPMI): $SPPMI(w,c) = \text{max}(PMI(w,c) – \text{log} \: k,0)$.

#### Context distribution smoothing

In SGNS, the negative samples are sampled according to a _smoothed_ unigram distribution, i.e. an unigram distribution raised to the power of $\alpha$, which is empirically set to $\frac{3}{4}$. This leads to frequent words being sampled relatively less often than their frequency would indicate.

We can transfer this to PMI by equally raising the frequency of the context words $f(c)$ to the power of $\alpha$:

$PMI(w, c) = \text{log} \frac{p(w,c)}{p(w)p_\alpha(c)}$ where $p_\alpha(c) = \frac{f(c)^\alpha}{\sum_c f(c)^\alpha}$ and $f(x)$ is the frequency of word $x$.

### Post-processing

Just like in pre-processing, three methods can be used to modify the word vectors produced by an algorithm.

The authors of GloVe recommend the addition of word vectors and context vectors to create the final output vectors, e.g. $\vec{v}_{\text{cat}} = \vec{w}_{\text{cat}} + \vec{c}_{\text{cat}}$. This adds first-order similarity terms, i.e $w \cdot v$. This method, however, cannot be applied to PMI, as the vectors produced by PMI are too infrequent.

#### Eigenvalue weighting

SVD produces the following matrices: $W^{SVD} = U_d \cdot \Sigma_d$ and $C^{SVD} = V_d$. These matrices, however, have different properties: $C^{SVD}$ is orthonormal, while $W^{SVD}$ is not.

SGNS is more symmetric in contrast. We can thus weight the eigenvalue matrix $\Sigma_d$ with an additional parameter $p$, which can be tuned, to yield the following:

$W^{SVD} = U_d \cdot \Sigma_d^p$.

#### Vector normalisation

Finally, we can also normalise all vectors to unit length.

## Results

Levy et al. (2015) train all models on a dump of the English wikipedia and evaluate them on the commonly used word similarity and analogy datasets. You can read more about the experimental setup and training details in their paper. We summarise the most important results and takeaways below.

### Takeaways

Levy et al. find that SVD — and not one of the word embedding algorithms — performs best on similarity tasks, while SGNS performs best on analogy datasets. They furthermore shed light on the importance of hyperparameters compared to other choices:

1. Hyperparameters vs. algorithms:
Hyperparameter settings are often more important than algorithm choice.
No single algorithm consistently outperforms the other methods.
2. Hyperparameters vs. more data:
Training on a larger corpus helps for some tasks.
In 3 out of 6 cases, tuning hyperparameters is more beneficial.

### Debunking prior claims

Equipped with these insights, we can now debunk some generally held claims:

1. Are embeddings superior to distributional methods?
With the right hyperparameters, no approach has a consistent advantage over another.
2. Is GloVe superior to SGNS?
SNGS outperforms GloVe on all comparison tasks of Levy et al. This should necessarily be taken with a grain of salt as GloVe might perform better on other tasks.
3. Is CBOW a good word2vec configuration?
CBOW does not outperform SGNS on any task.

## Recommendations

DON’T use shifted PPMI with SVD.

DON’T use SVD “correctly”, i.e. without eigenvector weighting (performance drops 15 points compared to with eigenvalue weighting with $p = 0.5$).

DO use PPMI and SVD with short contexts (window size of $2$).

DO use many negative samples with SGNS.

DO always use context distribution smoothing (raise unigram distribution to the power of $\alpha = 0.75$) for all methods.

DO use SGNS as a baseline (robust, fast and cheap to train).

DO try adding context vectors in SGNS and GloVe.

## Conclusion

These results are in contrast to the general consensus that word embeddings are superior to traditional methods. Rather, they indicate that it typically makes no difference whatsoever whether word embeddings or distributional methods are used. What really matters is that your hyperparameters are tuned and that you utilize the appropriate pre-processing and post-processing steps.

Recent studies by Jurafsky’s group [13], [14] reflect these findings and illustrate that SVD, rather than SGNS, is commonly the preferred choice accurate word representations is important.

We hope this overview of word embeddings has helped to highlight some fantastic research that sheds light on the relationship between traditional distributional semantic and in-vogue embedding models.

## References

[1]: Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A Neural Probabilistic Language Model. The Journal of Machine Learning Research, 3, 1137–1155. http://doi.org/10.1162/153244303322533223

[2]: Mikolov, T., Corrado, G., Chen, K., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space. Proceedings of the International Conference on Learning Representations (ICLR 2013), 1–12.

[3]: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. NIPS, 1–9.

[4]: Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing. Proceedings of the 25th International Conference on Machine Learning – ICML ’08, 20(1), 160–167. http://doi.org/10.1145/1390156.1390177

[5]: Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, 1532–1543. http://doi.org/10.3115/v1/D14-1162

[6]: Kim, Y., Jernite, Y., Sontag, D., & Rush, A. M. (2016). Character-Aware Neural Language Models. AAAI. Retrieved from http://arxiv.org/abs/1508.06615

[7]: Jozefowicz, R., Vinyals, O., Schuster, M., Shazeer, N., & Wu, Y. (2016). Exploring the Limits of Language Modeling. Retrieved from http://arxiv.org/abs/1602.02410

[8]: Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural Language Processing (almost) from Scratch. Journal of Machine Learning Research, 12 (Aug), 2493–2537. Retrieved from http://arxiv.org/abs/1103.0398

[9]: Chen, W., Grangier, D., & Auli, M. (2015). Strategies for Training Large Vocabulary Neural Language Models, 12. Retrieved from http://arxiv.org/abs/1512.04906

[10]: Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving Distributional Similarity with Lessons Learned from Word Embeddings. Transactions of the Association for Computational Linguistics, 3, 211–225. Retrieved from https://tacl2013.cs.columbia.edu/ojs/index.php/tacl/article/view/570

[11]: Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. ACL, 238–247. http://doi.org/10.3115/v1/P14-1023

[12]: Levy, O., & Goldberg, Y. (2014). Neural Word Embedding as Implicit Matrix Factorization. Advances in Neural Information Processing Systems (NIPS), 2177–2185. Retrieved from http://papers.nips.cc/paper/5477-neural-word-embedding-as-implicit-matrix-factorization

[13]: Hamilton, W. L., Clark, K., Leskovec, J., & Jurafsky, D. (2016). Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Retrieved from http://arxiv.org/abs/1606.02820

[14]: Hamilton, W. L., Leskovec, J., & Jurafsky, D. (2016). Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change. arXiv Preprint arXiv:1605.09096.

## Intro

Risk analysis is a fundamental aspect of any business strategy. It involves identifying and assessing events and occurrences that could negatively affect an organization. The goal of risk analysis is to allow organizations to uncover and examine any risks that may be a factor to their operation as a whole, or to individual products, campaigns, projects and future plans.

## How does media monitoring help with risk analysis?

Media monitoring is a major aspect of a modern risk management and analysis strategy. The sheer amount of content created online through blogs, news outlets and social media channels provides organizations with an easily accessible and publicly available source of public opinion and reactions to worldwide events as they unfold.

Put simply, media monitoring enables you to keep track of the general reaction to specific events, whether controlled or not, that might impact an organization. There is a wealth of freely available information out there, but it can be challenging to filter through the noise to find what really matters.

## What are the challenges associated with media monitoring?

The main challenge that organizations face when trying to monitor content on the web is the sheer and ever-expanding volume of content being uploaded to the web each and every day. Each company will have specific reasons for mining and monitoring this content and each will have a unique variety of aspects that they are interested in looking at.

What’s required is a solution that allows for super specific and flexible search options, allowing for the sourcing of highly relevant content, as it’s published and thus providing intelligent insights, trends and actionable data from this sourced content.

## How is media monitoring being used?

The ability to collect, index and analyze content at scale provides an efficient way to harness publicly available content in order to source and unearth key business insights and trends, that could potentially have adverse affects on a business.

Accuracy and timeliness of information are crucial aspects of this process and with this in mind, we’ll show you some examples of how media monitoring is being used in risk analysis and how our News API can help you keep your finger on the pulse and stay aware of important developments as they happen;

### 1. Monitoring public opinion and identifying threats

Public opinion towards an organization or their employees, products and brands can often be the making (or breaking) of them. Reputations of organizations among the public can often deteriorate over time and it can be crucial to try and spot such trends before they become a serious issue. To achieve this, the continuous monitoring of specific media searches is required.

Our News API supports Boolean Search which allows you to build simple search queries using standard boolean operators. This means that you can build either general or more targeted queries based on your interests and requirements. As an example, let’s search the following;

Articles that are in English, contain Samsung in the title, have a negative sentiment and were published between 60 days ago and now.

Why negative sentiment? If I’m assessing risk around a certain company, I’ll certainly want to know what they have been up to in recent times, and in particular, any bad press they have been subject to. As we know, bad press usually equates to a negative public perception.

News API results are returned in JSON format, making it easy for you to use the data as you please. Here’s a few examples of visualizations generated from the above search:

#### Sentiment Polarity

The red line in the graph below represents the levels of negative polarity towards Samsung over the past 60 days.

It’s clear from this chart that Samsung received some pretty bad press in the month of September, judging by the sharp increase in negative polarity, versus August. Now, of course, we want to know why and whether the root cause of this negativity is going to be a concern or potential risk.

Let’s generate a word cloud from our search results to uncover the most commonly used terms from the stories published during this time period.

It doesn’t take long to spot some potential causes for concern with battery, batteries, recall, fire, safety and problem all evident.

Samsung’s recent battery issues were well documented in the media, but what may have slipped under the radar for many was the involvement of the Federal Aviation Administration in this event, as you can see in our world cloud above. By modifying our original search to include Federal Aviation Administration, we can dive deeper into their involvement;

This is a prime example of how targeted searches using the News API can help unearth unforeseen threats or concerns by monitoring public opinion around the entities and events that matter most to your organization.

### 2. Monitoring competitor and industry activity

Monitoring and analyzing competitor activity can equip you with a wealth of information and provide hints of strategic movements that can provide you with a competitive advantage in your quest for market dominance.

Naturally, competitor activity generates a potential threat to the success of your organization. Just look at Apple and Samsung, for example, where it seems that each action either company takes is carefully scrutinized, analyzed and compared to the other. Samsung were certainly quick to react to Apple’s ‘bendy’ iPhone 6!

While it was hard to miss stories about Bendgate in the news, not all stories receive such mainstream attention and could easily be missed if you’re not looking at the relevant channels. By monitoring for mentions of specific organizations, brands, products, people and so on, you can be altered as soon as matching article is published. Not only does this make it easy to keep track of your direct competition, it can also help you keep abreast of general industry goings-on and any murmurs of potential new competitors or industry concerns.

### 3. Crisis management

With so many factors and variables at play and infinite external influences, no industry is immune to a potential crisis. How individual organizations react to such crises can ultimately decide whether they survive to see the end of it or not. Let’s take a look at one industry in particular that has been coming under recent scrutiny for including unsustainable or environmentally-unfriendly ingredients in many of its products – the cosmetics industry.

One such ingredient is palm oil, a substance that has been linked to major issues such as deforestation, habitat degradation, climate change and animal cruelty in the countries where it is produced. As a cosmetics manufacturer who uses palm oil, as many do, the intensifying spotlight on this substance is bound to be of considerable concern.

By monitoring mentions of palm oil in the news, these manufacturers can keep up to date with the latest developments, as they happen, putting them in a strong position to react as soon as required. Below is one such example of a story that was returned while monitoring mentions of ‘Palm Oil’ in African media;

Further analysis can show trends in the likes of social media shares or article length breakdown, either of which could signify a growing emphasis on Palm Oil among the public and media. Looking again at the image above, you can see the number of social shares for this particular story just beneath the title.

## 4. Trend analysis

With access to a world of news content and intelligent insights comes the opportunity for countless analyses and comparisons of trends. As an example, let’s search by category to see if there any noticeable differences or trends emerging from news stories in two separate countries.

The category we’ll look at is Electric Cars and the two source countries being analyzed are the UK and Australia. Below we have visual representations of the sentiment levels returned for each search, from the past 30 days.

As you can see, the vast majority of stories have been written in a neutral manner. What we’re interested in, however, is the significant difference in the levels of negative sentiment between the two countries around our chosen category.

Our results show that the Australian media are perhaps not too keen on the idea of Electric Cars, or perhaps there has been some negative publicity around the topic in recent times. On further inspection, we found that the uptake of electric cars has been extremely low in Australia compared to other countries, with manufacturers citing a lack of government assistance for this.

While this may seem like a straightforward comparison, when applied at scale it is this level of analysis that enables risk assessors to spot trends and ultimately improve their decision-making process. By analyzing multiple metrics side by side, interesting trends can emerge. Looking at the comparison below, again between the UK and Australia, it is evident that even in the past two months, the volume of stories relating to electric cars is increasing in Australia, but general interest still lags considerably behind the UK.

Business owners and project managers understand who and where potential threats can come from, and therefore have a very defined variety of entities and elements that need to be monitored. Projects that are based in, or focused on, different geographic locations will often pose their own unique threats and challenges. A multi-region project, for example, will require multiple risk assessments as part of the overall risk analysis process.

## Conclusion

With each project comes a new set of challenges and potential threats. The more an organization can learn about these threats the greater chance they have of reducing the level of risk involved in making certain decisions or strategic moves. Media monitoring provides risk assessors with a wealth of publicly available information, from which intelligent insights, trends and analyses can be drawn.

However specific or niche your own search requirements are, with 24/7 worldwide news monitoring backed up by cutting edge Machine Learning and Natural Language Processing technology, our News API can help you with your risk analysis needs.

Ready to get started? Try the News API free for 14 days and with our Getting Started with the News API guides below, you’ll be up and running in no time.

Getting Started with the News API Part 1: Search

Getting Started with the News API Part 2: Insights

At Aylien we are using recent advances in Artificial Intelligence to try to understand natural language. Part of what we do is building products such as our Text Analysis API and News API to help people extract meaning and insight from text. We are also a research lab, conducting research that we believe will make valuable contributions to the field of Artificial Intelligence, as well as driving further product development (see this post about a recent publication on aspect-based sentiment analysis by one of our research scientists for example).

We are excited to announce that we are currently accepting applications from researchers from academia who would like to collaborate on joint research projects, as part of the Science Foundation Ireland Industry Fellowship programme. The main aim of these projects is to conduct and publish novel research in shared areas of interest, but we are also happy to work with researchers who are interested in collaborating on building products. For researchers, we feel that this is a great opportunity to work in industry with a team of talented scientists and engineers, and with the resources and infrastructure to support your work.

We are particularly interested in collaborating on work in the following areas (but are open to other suggestions):

• Representation Learning
• Domain Adaptation and Transfer Learning
• Sentiment Analysis
• Dialogue Systems
• Entity and Relation Extraction
• Topic Modeling
• Document Classification
• Taxonomy Inference
• Document Summarization
• Machine Translation

## Details and requirements

This work is funded by the SFI Industry Fellowship programme, and so to be eligible you must have a PhD and be currently hired as a Postdoctoral Researcher, a Research Fellow or a Lecturer by a Higher Education Institution in Ireland.

The initial duration of the project is between 12 and 24 months (12 months full-time, but you can spread it over up to 24 months if you like), but we hope to continue the collaboration afterwards. The total amount of funding available is up to €100,000.

The final application deadline is December 2nd 2016, with a tentative start date at Aylien in March 2017. We will work with you on the application, so if you are interested then you should get in touch with us as soon as possible.

You must have:

• A PhD degree in a related field (science, engineering, or mathematics), and be currently hired as a Postdoctoral Researcher, a Research Fellow or a Lecturer by a higher education institution in Ireland.
• Strong knowledge of at least one programming language, and general software engineering skills (experience with version control systems, debugging, testing, etc.).
• The ability to grasp new concepts quickly.

Preferably you should also have:

• A strong knowledge of Artificial Intelligence and its subfields.
• A strong Machine Learning background.
• Experience with the scientific Python stack: NumPy, SciPy, scikit-learn, TensorFlow, Theano, Keras, etc.
• Experience with Deep Learning.
• Good understanding of linguistics and language.
• Experience with non-English NLP.

## Research at Aylien

We are a team of research scientists (PhDs and PhD students) and research engineers, working in an open and transparent way with no bureaucracy. If you join us, you will get to work with a seasoned engineering team that will allow you to take successful algorithms from prototype into production. You will also have the opportunity to work on many technically challenging problems, publish at conferences, and contribute to the development of state-of-the-art Deep Learning and natural language understanding applications.

We work primarily on Deep Learning with TensorFlow and the rest of the the scientific Python stack (Numpy, SciPy, scikit-learn etc.).