# Sentiment Analysis

We will build a machine learning model to detect sentiment (i.e. detect if a sentence is positive or negative). This will be done on movie reviews, using the [IMDb dataset](http://ai.stanford.edu/~amaas/data/sentiment/).

Based on notebooks by Ben Trevett (https://github.com/bentrevett/pytorch-sentiment-analysis) distributed under the MIT Licence.

## Preparing Data

One of the main concepts of TorchText is the `Field`. These define how your data should be processed. In our sentiment classification task the data consists of both the raw string of the review and the sentiment, either "pos" or "neg".

The parameters of a `Field` specify how the data should be processed. 

We use the `TEXT` field to define how the review should be processed, and the `LABEL` field to process the sentiment. 

Our `TEXT` field has `tokenize='spacy'` as an argument. This defines that the "tokenization" (the act of splitting the string into discrete "tokens") should be done using the [spaCy](https://spacy.io) tokenizer. If no `tokenize` argument is passed, the default is simply splitting the string on spaces.

`LABEL` is defined by a `LabelField`, a special subset of the `Field` class specifically used for handling labels. We will explain the `dtype` argument later.

For more on `Fields`, go [here](https://github.com/pytorch/text/blob/master/torchtext/data/field.py).

We also set the random seeds for reproducibility. 

In [None]:
import torch
from torchtext import data

SEED = 1234

torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(dtype=torch.float)

Another handy feature of TorchText is that it has support for common datasets used in natural language process (NLP). 

The following code automatically downloads the IMDb dataset and splits it into the canonical train/test splits as `torchtext.datasets` objects. It process the data using the `Fields` we have previously defined. 

In [None]:
from torchtext import datasets

train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

We can see how many examples are in each split by checking their length.

In [None]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of testing examples: {len(test_data)}')

We can also check an example.

In [None]:
print(vars(train_data.examples[0]))

The IMDb dataset only has train/test splits, so we need to create a validation set. We can do this with the `.split()` method. 

By default this splits 70/30, however by passing a `split_ratio` argument, we can change the ratio of the split, i.e. a `split_ratio` of 0.8 would mean 80% of the examples make up the training set and 20% make up the validation set. 

We also pass our random seed to the `random_state` argument, ensuring that we get the same train/validation split each time.

In [None]:
import random

train_data, valid_data = train_data.split(random_state=random.seed(SEED))

Again, we'll view how many examples are in each split.

In [None]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Next, we have to build a _vocabulary_. This is a effectively a look up table where every unique word in your data set has a corresponding _index_ (an integer).

We do this as our machine learning model cannot operate on strings, only numbers. Each _index_ is used to construct a _one-hot_ vector for each word. A one-hot vector is a vector where all of the elements are 0, except one, which is 1, and dimensionality is the total number of unique words in your vocabulary, commonly denoted by $V$.

![](https://i.imgur.com/0o5Gdar.png)

The number of unique words in our training set is over 100,000, which means that our one-hot vectors will have over 100,000 dimensions! This will make training slow and possibly won't fit onto your GPU (if you're using one). 

There are two ways effectively cut down our vocabulary, we can either only take the top $n$ most common words or ignore words that appear less than $m$ times. We'll do the former, only keeping the top 25,000 words.

What do we do with words that appear in examples but we have cut from the vocabulary? We replace them with a special _unknown_ or `<unk>` token. For example, if the sentence was "This film is great and I love it" but the word "love" was not in the vocabulary, it would become "This film is great and I `<unk>` it".

The following builds the vocabulary, only keeping the most common `max_size` tokens.

In [None]:
TEXT.build_vocab(train_data, max_size=25000)
LABEL.build_vocab(train_data)

Why do we only build the vocabulary on the training set? When testing any machine learning system you do not want to look at the test set in any way. We do not include the validation set as we want it to reflect the test set as much as possible.

In [None]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

Why is the vocab size 25002 and not 25000? One of the addition tokens is the `<unk>` token and the other is a `<pad>` token.

When we feed sentences into our model, we feed a _batch_ of them at a time, i.e. more than one at a time, and all sentences in the batch need to be the same size. Thus, to ensure each sentence in the batch is the same size, any shorter than the longest within the batch are padded.

![](https://i.imgur.com/TZRJAX4.png)

We can also view the most common words in the vocabulary. 

In [None]:
print(TEXT.vocab.freqs.most_common(20))

We can also see the vocabulary directly using either the `stoi` (**s**tring **to** **i**nt) or `itos` (**i**nt **to**  **s**tring) method.

In [None]:
print(TEXT.vocab.itos[:10])

We can also check the labels, ensuring 0 is for negative and 1 is for positive.

In [None]:
print(LABEL.vocab.stoi)

The final step of preparing the data is creating the iterators. We iterate over these in the training/evaluation loop, and they return a batch of examples (indexed and converted into tensors) at each iteration.

We'll use a `BucketIterator` which is a special type of iterator that will return a batch of examples where each example is of a similar length, minimizing the amount of padding per example.

We also want to place the tensors returned by the iterator on the GPU (if you're using one). PyTorch handles this using `torch.device`, we then pass this device to the iterator.

In [None]:
BATCH_SIZE = 64

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size=BATCH_SIZE,
    device=device)

## Getting ready for training models

In [None]:
import torch
import torch.nn as nn

The criterion criterion function calculates the loss, however we have to write our function to calculate the accuracy. 

This function first feeds the predictions through a sigmoid layer, squashing the values between 0 and 1, we then round them to the nearest integer. This rounds any value greater than 0.5 to 1 (a positive sentiment) and the rest to 0 (a negative sentiment).

We then calculate how many rounded predictions equal the actual labels and average it across the batch.

In [None]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """

    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum()/len(correct)
    return acc

The `train` function iterates over all examples, one batch at a time. 

`model.train()` is used to put the model in "training mode", which turns on _dropout_ and _batch normalization_. Although we aren't using them in this model, it's good practice to include it.

For each batch, we first zero the gradients. Each parameter in a model has a `grad` attribute which stores the gradient calculated by the `criterion`. PyTorch does not automatically remove (or "zero") the gradients calculated from the last gradient calculation, so they must be manually zeroed.

We then feed the batch of sentences, `batch.text`, into the model. Note, you do not need to do `model.forward(batch.text)`, simply calling the model works. The `squeeze` is needed as the predictions are initially size _**[batch size, 1]**_, and we need to remove the dimension of size 1 as PyTorch expects the predictions input to a loss function to simply be of size _**[batch size]**_.

The loss and accuracy are then calculated using our predictions and the labels, `batch.label`. 

We calculate the gradient of each parameter with `loss.backward()`, and then update the parameters using the gradients and optimizer algorithm with `optimizer.step()`.

The loss and accuracy is accumulated across the epoch, the `.item()` method is used to extract a scalar from a tensor which only contains a single value.

Finally, we return the loss and accuracy, averaged across the epoch. The `len` of an iterator is the number of batches in the iterator.

You may recall when initializing the `LABEL` field, we set `dtype=torch.float`. This is because TorchText sets tensors to be `LongTensor`s by default, however our criterion expects both inputs to be `FloatTensor`s. As we have manually set the `dtype` to be `torch.float`, this is automatically done for us. The alternative method of doing this would be to do the conversion inside the `train` function by passing `batch.label.float()` instad of `batch.label` to the criterion. 

In [None]:
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        
        optimizer.zero_grad()
                
        predictions = model(batch.text).squeeze(1)
        
        loss = criterion(predictions, batch.label)
        
        acc = binary_accuracy(predictions, batch.label)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

`evaluate` is similar to `train`, with a few modifications as you don't want to update the parameters when evaluating.

`model.eval()` puts the model in "evaluation mode", this turns off _dropout_ and _batch normalization_. Again, we are not using them in this model, but it is good practice to include them.

No gradients are calculated on PyTorch operations inside the `with no_grad()` block. This causes less memory to be used and speeds up computation.

The rest of the function is the same as `train`, with the removal of `optimizer.zero_grad()`, `loss.backward()` and `optimizer.step()`, as we do not update the model's parameters when evaluating.

In [None]:
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:

            predictions = model(batch.text).squeeze(1)
            
            loss = criterion(predictions, batch.label)
            
            acc = binary_accuracy(predictions, batch.label)

            epoch_loss += loss.item()
            epoch_acc += acc.item()
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

In the training, we'll define our loss function and the optimizer. In PyTorch this is commonly called a criterion. 

The loss function here is _binary cross entropy with logits_. 

The prediction for each sentence is an unbound real number, as our labels are either 0 or 1, we want to restrict the number between 0 and 1, we do this using the _sigmoid_ or _logit_ functions. 

We then calculate this bound scalar using binary cross entropy. 

The `BCEWithLogitsLoss` criterion carries out both the sigmoid and the binary cross entropy steps.

We then train the model through multiple epochs, an epoch being a complete pass through all examples in the split.

In [None]:
import torch.optim as optim

def training_loop(model, n_epochs):
    criterion = nn.BCEWithLogitsLoss()
    criterion = criterion.to(device)
    optimizer = optim.Adam(model.parameters(), lr=1e-3)
    
    for epoch in range(n_epochs):

        train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
        valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)

        print(f'| Epoch: {epoch+1:02} | Train Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}% | Val. Loss: {valid_loss:.3f} | Val. Acc: {valid_acc*100:.2f}% |')
        
    print()
    print("Training finished, testing the model:")
    test_loss, test_acc = evaluate(model, test_iterator, criterion)

    print(f'| Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}% |')

## Mean-pooled word embeddings

In [None]:
class PoolEmbeddings(nn.Module):
    def __init__(self, input_dim, embedding_dim):
        super().__init__()
        
        # TODO: define an embedding layer
        # TODO: define a fully connected layer with a single scalar output
        
    def forward(self, x):
        # TODO: apply the embeddings on the inpu
        # TODO: mean-pool the embeddings
        # TODO: apply the fully connected layer on the embeddings
            
        pass

In [None]:
model_1 = PoolEmbeddings(INPUT_DIM, 100, 1)
model_1 = model_1.to(device)

In [None]:
training_loop(model, 20)

## 1-D Convolution with Max-Pooling

One of the first succesful deep learning model for sentence classification was a convolution with max pooling. The model was introduced in 2014 in a paper called [Convolutional Neural Networks for Sentence Classification](https://www.aclweb.org/anthology/D14-1181).

The model has the following structure:
    
    - word embeddings are first processed with a 1D convolution of widow sizes 3, 4, 5 of 100 filters
    - the filters are max-pooled over time
    - a classifier is applied on the max-pooled states
    

In [None]:
class Convolutional(nn.Module):
    def __init__(self, input_dim, embedding_dim, kernel_sizes, filters, output_dim):
        super().__init__()
        
        # TODO: create all necessary layers
        
        # hint: if you want to use a list of modules, you need to use `nn.ModuleList`
        
    def forward(self, x):

        # TODO: apply the layers on data x
        
        # hint: max-pooling is just applying maximum over the correct axis
        
        pass

In [None]:
model_2 = Convolutional(INPUT_DIM, 100, [3,4,5], 100, 1)
model_2 = model_2.to(device)

In [None]:
training_loop(model_2, 10)

## LSTM

Apply an LSTM layer over embeddings and use the final state of network for classifcation. Optionally, you can try bidirectional network, layer normalization and deeper network with residual connections.

In [1]:
import torch.nn as nn

class Recurrent(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim):
        super().__init__()
        
        # TODO: create all necessary layers
        
    def forward(self, x):

        # TODO: apply the layers on data x
        
        pass

In [None]:
model_3 = Recurrent(INPUT_DIM, 100, 100, 1)
model_3 = model_3.to(device)

In [None]:
training_loop(model_3, 10)

## Pre-trained word embeddings

1. Download pre-trained embeddigs from fasttext
2. Prepare weight matrix for our vocabulary

In [None]:
TEXT.vocab.load_vectors("glove.6B.100d")
pretrained_embeddings = TEXT.vocab.vectors

Hint: You can just inherint the `PoolEmbeedings` class and only modify the constructor to use assign the embeddings weights to its embedding layer.

You can try the same also with CNN and LSTM models.

In [None]:
class PretrainedEmbeddingsMeanPool(PoolEmbeddings):
    def __init__(self, pretrained_embeddings):
        vocab_size, embeddings_dim = pretrained_embeddings.shape
        super().__init__(vocab_size, embeddings_dim)
        
        self.embedding.weight.data.copy_(pretrained_embeddings)
        self.embedding.weight.requires_grad = False

In [None]:
model_4 = PretrainedEmbeddingsMeanPool(pretrained_embeddings)
model_4 = model_4.to(device)

In [None]:
training_loop(model_4, 10)

## Pre-trained contextual embeddings

In [None]:
from allennlp.modules.elmo import Elmo, batch_to_ids

ELMO_OPTIONS = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_options.json"
ELMO_WEIGHTS = "https://s3-us-west-2.amazonaws.com/allennlp/models/elmo/2x4096_512_2048cnn_2xhighway/elmo_2x4096_512_2048cnn_2xhighway_weights.hdf5"

In [None]:
class PoolElmo(nn.Module):
    def __init__(self, elmo_options, elmo_weights):
        super().__init__()
        self.elmo = Elmo(elmo_options, elmo_weights, 2, dropout=0)
        self.fc = nn.Linear(2048, 1)
        
    def forward(self, x):
        ctx_embeddings = self.elmo(batch_to_ids(valid_data.text))
        output = torch.mean(ctx_embeddings, dim=0, keepdim=False)
        return self.fc(output)

In [None]:
model_5 = PoolElmo(ELMO_OPTIONS, ELMO_WEIGHTS)
model_5 = model_5.to(device)

In [None]:
training_loop(model_5, 10)