cnn module

CNN model for Sentiment Classification.

In this script, there is an implementation of a Convolutional Neural Network for Sentiment Classification. The sentiments are binary. To classify the data the model uses an Embedding Layer to convert words to an arithmetic sequence.

Convolutions are sliding window functions applied to a matrix that achieve specific results. The sliding window is called a kernel, filter, or feature detector. By representing each word with a vector of numbers of a specific length and stacking a bunch of words on top of each other, we get an image.

See Also

https://torchtext.readthedocs.io/en/latest/index.html

References

The Deep Learning Framework used for the development of the current module is Pytorch [1].

[1]

PyTorch: An Imperative Style, High-Performance Deep Learning Library by Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith, published in “Advances in Neural Information Processing Systems 32”, “Curran Associates, Inc.”, “H. Wallach and H. Larochelle and A. Beygelzimer and F. Buc and E. Fox and R. Garnett”, pp. 8024-8035, 2019.

class cnn.CNN(vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout, pad_idx)[source]

Bases: Module

Convolutional Neural Network model with Pretrained Embeddings.

Attributes:
vocab_sizeint

Size of the dictionary of embeddings.

embedding_dim: int

The size of each embedding vector.

n_filters: int

Number of channels produced by the convolution.

filter_sizes: list

A list that contains integers that correspond to the amount of channels produced by the convolution.

output_dim: int

The size of the output fully connected layer.

dropout: float

The probability of an element to be zeroed.

pad_idx: int

The numerical identifier mapped to the string token used as padding.

Methods

conv_and_pool(x, conv)

Applies 1d convolution.

forward(x)

Defines the computation performed by the CNN model at every call.

static conv_and_pool(x, conv)[source]

Applies 1d convolution.

The method applies a 2D convolution over the input. It then filters the convolved output using Rectified Linear Unit. The result is a tensor of size [32 x 64 x Y x 100] where:

  • 32 is the batch size

  • 64 is the number of filters

  • Y is the sequence length which is equal to the sentence length

  • 100 is size of the second dimension of the kernel of a convolutional layer

This temporary result is then squeezed to yields a tensor of size [32 x 64 x Y]. In the last step, the method applies a 1D max pooling over the squeezed tensor with a sliding window of size equal to Y. The result is then squeezed again to produce a tensor of size [32 x 64].

Parameters:
x: torch.tensor

This is a tensor of type float that operates as input for each convolution layer.

conv: torch.nn.Module

Applies a 2D convolution over a given input.

Returns:
torch.tensor

The 2D tensor to be used in the linear layer.

forward(x)[source]

Defines the computation performed by the CNN model at every call.

This method forwards the given input to every single model layer.

Parameters:
x: torch.tensor

This is a tensor of type int that operates as input for the defined model.

Returns:
torch.tensor

The tensor containing the predictions made by the model.

training: bool
cnn.binary_accuracy(preds, y)[source]

Computes prediction accuracy.

This method is used to estimate the models accuracy over binary targets. The prediction is rounded to compare it to the true label.

Parameters:
preds: torch.tensor

These are the predictions returned by the model for an input batch.

y: torch.tensor

This is the ground truth tensor for the same input batch.

Returns:
torch.tensor

The model accuracy ratio for the given predictions. The tensor is a single float element container.

cnn.compute_vocab_size()[source]

Returns vocabulary size

This method computes vocabulary size. It uses the subsampling flag defined in the main thread. If subsampling is activated, then the method sets the vocabulary size equal to the subsampled estimation of the vocabulary size. Otherwise, it sets the vocabulary size equal to the total unique token count.

cnn.count_parameters()[source]

Counts model trainable parameters.

Returns:
int

The number of trainable parameters.

cnn.dataset_preprocessor(df, column, filepath)[source]

Preprocess text in given dataset.

This method calls the predefined NLP preprocessor to filter out any non-alphanumeric character found in the given dataset. The method saves afterward the result using the given filepath. The filepath can also be relative. An example filepath is provided:

./filtered_dataset.csv

Parameters:
df: pandas.DataFrame

This is the given dataset.

column: str

This is the column with the user reviews

filepath: str

This is the filepath to save the preprocessed dataset

cnn.epoch_time()[source]

Computes epoch duration.

This method is called upon the launch of each epoch, and upon the termination of each epoch. It then uses the checkpoints created to compute the epoch’s duration.

Returns:
int

Number of minutes rounded down that represent the running epoch’s duration

int

The remaining of seconds that represent the running epoch’s duration

cnn.evaluate(iterator)[source]

Evaluates a model.

This method is called either to validate the defined CNN model or to test it, by disabling gradient calculation. The method provides progress context for the user using progressbar.

Parameters:
iterator: torchtext.data.Iterator

An iterator to load batches of evaluation data from the given dataset.

Returns:
float

The epoch evaluation loss

float

The epoch evaluation accuracy

cnn.filter_prediction(prediction, critic)[source]

Provides feedback over a custom prediction.

This method takes the prediction of the model on a custom critic and defines if it was positive or negative. Finally, it prints the proper message.

Parameters:
prediction: float

The probability of a critic being negative

critic: str

The word sequence used as a critic

cnn.get_max_length(df)[source]

Computes maximum number of tokens given a dataframe.

This method is used to compute the maximum length found at the text column of the given dataframe. The column of the dataframe with the text is Summary. This function is useful if one decides to use padding for tokenization.

Parameters:
df: pandas.DataFrame

This is the dataset of the model.

Returns:
int

The maximum number of tokens found in a dataframe column.

cnn.manual_testing()[source]

Calls model upon custom critics.

In this method there are some movie critics defined to test the model with custom data.

cnn.nlp_preprocessor(text)[source]

Defines an NLP preprocessor.

This method takes some text and filters it. It deletes any non - alphanumeric character found. This is a standard preprocessing routine in machine learning models for NLP. It increases model’s performance.

Parameters:
text: str

This is the string to preprocess.

Returns:
str

The preprocessed - filtered string.

cnn.plot_loss_and_accuracy()[source]

Plots model’s fitting results.

This method takes the lists containing the training and the validation losses and plots them together. This method is useful when detecting an over-fitted model (or an under-fitted). The method saves the plot at the project directory.

cnn.predict_sentiment(sentence, min_len=5)[source]

Classifies a custom critic.

This method converts a sentence into arithmetic tokens. The tokens are then given to a trained model. The model predicts the sentiment of that critic.

Parameters:
sentence: str

The custom critic to be classified.

min_len: int (optional)

The minimum length of tokens of the given sentence.

Returns:
float

The probability of the critic being negative.

cnn.train(iterator)[source]

Fits a model.

This method is used to fit the defined CNN model. The method provides progress context for the user using progressbar.

Parameters:
iterator: torchtext.data.Iterator

An iterator to load batches of training data from the given dataset.

Returns:
float

The epoch training loss

float

The epoch training accuracy

cnn.train_validate_test_split(df, seed, train_percent=0.7, validate_percent=0.1)[source]

Splits the given dataset.

This method splits a dataframe into:
  • A dataframe used to train the model

  • A dataframe used to validate the model

  • A dataframe used to test the model

The indexes of the given datasets are shuffled.

Parameters:
df: pandas.DataFrame

This is the given dataset.

seed: int

This is the seed used for the NumPy shuffler.

train_percent: float (optional)

This is the dataset split ratio to get the sample data to fit the model.

validate_percent: float (optional)

This is the dataset split ratio to get the sample data to validate the model.