# A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis

Sentiment analysis is widely used to gauge public opinion towards products, to analyze customer satisfaction, and to detect trends. With the proliferation of customer reviews, more fine-grained aspect-based sentiment analysis (ABSA) has gained in popularity, as it allows aspects of a product or service to be examined in more detail. To this end, we have launched an ABSA service a while ago and demonstrated how the service can be used to gain insights into the strengths and weaknesses of a product.

For performing sentiment analysis on customer reviews (as well as with many other text classification tasks), we face the problem that there are many different categories of reviews such as books, electronics, restaurants, etc. (You only need to have a look at the Departments tab on Amazon to get a feeling for the diversity of these categories.) In Machine Learning and Natural Language Processing, we refer to these different categories as domains; every domain has their unique characteristics.

In practice, this means that a model that is trained on one domain will do worse on another domain depending on how dissimilar the two domains are. For instance, a model trained on the restaurants domain will do a lot worse on the books domain than on the comparatively more similar hotels domain. For aspect-based sentiment analysis, this problem is amplified as not only the domains but also the aspects belonging to those domains differ.

In addition, as the world becomes more globalized, customer reviews need to be analyzed in other languages besides English. We thus require large amounts of training data for a large number of language-domain pairs, which is infeasible in practice as annotation — particularly the meticulous annotation required for ABSA — is expensive.

There are two directions that complement each other, which we can take to address this deficit:

1. We can create models that allow us to transfer existing knowledge and adapt trained models to new domains without incurring a large performance hit. This area is called domain adaptation and we will talk about this in future blog posts.
2. We can create models that are more generic and that are able to generalize well even when trained on relatively few data by leveraging information inherent in the training data. In the rest of this blog post, we will talk about the approach we took in our EMNLP 2016 paper towards this goal.

Even though Deep Learning-based models constitute the state-of-the-art in many NLP tasks, they traditionally do well only with large amounts of data. Finding ways to help them generalize with only few data samples is thus an important research problem in its own right. Taken to the extreme, we would like to emulate the way humans learn from only few examples, also known as One-Shot Learning.

Reviews — just with any coherent text — have an underlying structure. In the discourse structure of the review, sentences are connected via different rhetorical relations. Intuitively, knowledge about the relations and the sentiment of surrounding sentences should inform the sentiment of the current sentence. If a reviewer of a restaurant has shown a positive sentiment towards the quality of the food, it is likely that his opinion will not change drastically over the course of the review. Additionally, overwhelmingly positive or negative sentences in the review help to disambiguate sentences whose sentiment is equivocal.

Existing Deep Learning models for sentiment analysis act only on the sentence level; while they are able to consider intra-sentence relations, they fail to capture inter-sentence relations that rely on discourse structure and provide valuable clues for sentiment prediction.

We propose a hierarchical bidirectional long short-term memory (H-LSTM) that is able to leverage both intra– and inter-sentence relations. Because our model only relies on sentences and their structure within a review, it is fully language-independent.

## Model

Figure 1: The hierarchical bidirectional LSTM (H-LSTM) for aspect-based sentiment analysis. Word embeddings are fed into a sentence-level bidirectional LSTM. Final states of forward and backward LSTM are concatenated together with the aspect embedding and fed into a bidirectional review-level LSTM. At every time step, the output of the forward and backward LSTM is concatenated and fed into a final layer, which outputs a probability distribution over sentiments.

You can view the architecture of our model in the image above. The model consists of the following components:

## LSTM

We use a Long Short-Term Memory (LSTM), which adds input, output, and forget gates to a recurrent cell, which allow it to model long-range dependencies that are essential for capturing sentiment.

For the $$t$$th word in a sentence, the LSTM takes as input the word embedding $$x_t$$, the previous output $$h_{t-1}$$ and cell state $$c_{t-1}$$ and computes the next output $$h_t$$ and cell state $$c_t$$. Both $$h$$ and $$c$$ are initialized with zeros.

## Bidirectional LSTM

Both on the review and on the sentence level, sentiment is dependent not only on preceding but also successive words and sentences. A Bidirectional LSTM (Bi-LSTM) allows us to look ahead by employing a forward LSTM, which processes the sequence in chronological order, and a backward LSTM, which processes the sequence in reverse order. The output $$h_t$$ at a given time step is then the concatenation of the corresponding states of the forward and backward LSTM.

## Hierarchical Bidirectional LSTM

Stacking a Bi-LSTM on the review level on top of sentence-level Bi-LSTMs yields the hierarchical bidirectional LSTM (H-LSTM) in Figure 1.

The sentence-level forward and backward LSTMs receive the sentence starting with the first and last word embedding $$x\_{1}$$ and $$x\_l$$ respectively. The final output $$h\_l$$ of both LSTMs is then concatenated with the aspect vector $$a$$ and fed as input into the review-level forward and backward LSTMs. The outputs of both LSTMs are concatenated and fed into a final softmax layer, which outputs a probability distribution over sentiments for each sentence.

## Evaluation

The most popular benchmark for ABSA is the SemEval Aspect-based Sentiment Analysis task. We evaluate on the most recent edition of this task, the SemEval-2016 ABSA task. To demonstrate the language and domain independence of our model, we evaluate on datasets in five domains (restaurants, hotels, laptops, phones, cameras) and eight languages (English, Spanish, French, Russian, Dutch, Turkish, Arabic, Chinese) from the competition.

We compare our model using random (H-LSTM) and pre-trained word embeddings (HP-LSTM)  against the best model of the SemEval-2016 Aspect-based Sentiment Analysis task for each domain-language pair (Best) as well as against the two best single models of the competition: IIT-TUDA (Kumar et al., 2016), which uses large sentiment lexicons for every language, and XRCE (Brun et al., 2016), which uses a parser augmented with hand-crafted, domain-specific rules.

Language Domain Best XRCE IIT-TUDA CNN LSTM H-LSTM HP-LSTM
English Restaurants 88.1 88.1 86.7 82.1 81.4 83.0 85.3
Spanish Restaurants 83.6 83.6 79.6 75.7 79.5 81.8
French Restaurants 78.8 78.8 72.2 73.2 69.8 73.6 75.4
Russian Restaurants 77.9 73.6 75.1 73.9 78.1 77.4
Dutch Restaurants 77.8 77.0 75.0 73.6 82.2 84.8
Turkish Restaurants 84.3 84.3 74.2 73.6 76.7 79.2
Arabic Hotels 82.7 81.7 82.7 80.5 82.8 82.9
English Laptops 82.8 82.8 78.4 76.0 77.4 80.1
Dutch Phones 83.3 82.6 83.3 81.8 81.3 83.6
Chinese Cameras 80.5 78.2 77.6 78.6 78.8
Chinese Phones 73.3 72.4 70.3 74.1 73.3

Table 1: Results of our system with randomly initialized word embeddings (H-LSTM) and with pre-trained embeddings (HP-LSTM) for ABSA for each language and domain in comparison to the best system for each pair (Best), the best two single systems (XRCE, IIT-TUDA), a sentence-level CNN (CNN), and our sentence-level LSTM (LSTM).

As you can see in the table above, our hierarchical model achieves results superior to the sentence-level CNN and the sentence-level Bi-LSTM baselines for almost all domain-language pairs by taking the structure of the review into account.

In addition, our model shows results competitive with the best single models of the competition, while requiring no expensive hand-crafted features or external resources, thereby demonstrating its language and domain independence. Overall, our model compares favorably to the state-of-the-art, particularly for low-resource languages, where few hand-engineered features are available. It outperforms the state-of-the-art on four and five datasets using randomly initialized and pre-trained embeddings respectively. For more details, refer to our paper.

Let's Talk