cnn module

CNN model for Sentiment Classification.

In this script, there is an implementation of a Convolutional Neural Network for Sentiment Classification. The sentiments are binary. To classify the data the model uses an Embedding Layer to convert words to an arithmetic sequence.

Convolutions are sliding window functions applied to a matrix that achieve specific results. The sliding window is called a kernel, filter, or feature detector. By representing each word with a vector of numbers of a specific length and stacking a bunch of words on top of each other, we get an image.

References

The Deep Learning Framework used for the development of the current module is Pytorch [1].

[1]

PyTorch: An Imperative Style, High-Performance Deep Learning Library by Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith, published in “Advances in Neural Information Processing Systems 32”, “Curran Associates, Inc.”, “H. Wallach and H. Larochelle and A. Beygelzimer and F. Buc and E. Fox and R. Garnett”, pp. 8024-8035, 2019.

class cnn.CNN(vocab_size, embedding_dim, n_filters, filter_sizes, output_dim, dropout, pad_idx)[source]

Bases: Module

Convolutional Neural Network model with Pretrained Embeddings.

Attributes:

vocab_sizeint: Size of the dictionary of embeddings.
embedding_dim: int: The size of each embedding vector.
n_filters: int: Number of channels produced by the convolution.
filter_sizes: list: A list that contains integers that correspond to the amount of channels produced by the convolution.
output_dim: int: The size of the output fully connected layer.
dropout: float: The probability of an element to be zeroed.
pad_idx: int: The numerical identifier mapped to the string token used as padding.

Methods

conv_and_pool(x, conv)	Applies 1d convolution.
forward(x)	Defines the computation performed by the CNN model at every call.

static conv_and_pool(x, conv)[source]

Applies 1d convolution.

The method applies a 2D convolution over the input. It then filters the convolved output using Rectified Linear Unit. The result is a tensor of size [32 x 64 x Y x 100] where:

32 is the batch size

64 is the number of filters

Y is the sequence length which is equal to the sentence length

100 is size of the second dimension of the kernel of a convolutional layer

This temporary result is then squeezed to yields a tensor of size [32 x 64 x Y]. In the last step, the method applies a 1D max pooling over the squeezed tensor with a sliding window of size equal to Y. The result is then squeezed again to produce a tensor of size [32 x 64].

Parameters:

x: torch.tensor: This is a tensor of type float that operates as input for each convolution layer.
conv: torch.nn.Module: Applies a 2D convolution over a given input.

Returns:

torch.tensor: The 2D tensor to be used in the linear layer.

forward(x)[source]

Defines the computation performed by the CNN model at every call.

This method forwards the given input to every single model layer.

Parameters:

x: torch.tensor: This is a tensor of type int that operates as input for the defined model.

Returns:

torch.tensor: The tensor containing the predictions made by the model.

training: bool

cnn.binary_accuracy(preds, y)[source]

Computes prediction accuracy.

This method is used to estimate the models accuracy over binary targets. The prediction is rounded to compare it to the true label.

Parameters:

preds: torch.tensor: These are the predictions returned by the model for an input batch.
y: torch.tensor: This is the ground truth tensor for the same input batch.

Returns:

torch.tensor: The model accuracy ratio for the given predictions. The tensor is a single float element container.

cnn.compute_vocab_size()[source]

Returns vocabulary size

This method computes vocabulary size. It uses the subsampling flag defined in the main thread. If subsampling is activated, then the method sets the vocabulary size equal to the subsampled estimation of the vocabulary size. Otherwise, it sets the vocabulary size equal to the total unique token count.

cnn.count_parameters()[source]

Counts model trainable parameters.

Returns:

int: The number of trainable parameters.

cnn.dataset_preprocessor(df, column, filepath)[source]

Preprocess text in given dataset.

This method calls the predefined NLP preprocessor to filter out any non-alphanumeric character found in the given dataset. The method saves afterward the result using the given filepath. The filepath can also be relative. An example filepath is provided:

./filtered_dataset.csv

Parameters:

df: pandas.DataFrame: This is the given dataset.
column: str: This is the column with the user reviews
filepath: str: This is the filepath to save the preprocessed dataset

cnn.epoch_time()[source]

Computes epoch duration.

This method is called upon the launch of each epoch, and upon the termination of each epoch. It then uses the checkpoints created to compute the epoch’s duration.

Returns:

int: Number of minutes rounded down that represent the running epoch’s duration
int: The remaining of seconds that represent the running epoch’s duration

cnn.evaluate(iterator)[source]

Evaluates a model.

This method is called either to validate the defined CNN model or to test it, by disabling gradient calculation. The method provides progress context for the user using progressbar.

Parameters:

iterator: torchtext.data.Iterator: An iterator to load batches of evaluation data from the given dataset.

Returns:

float: The epoch evaluation loss
float: The epoch evaluation accuracy

cnn.filter_prediction(prediction, critic)[source]

Provides feedback over a custom prediction.

This method takes the prediction of the model on a custom critic and defines if it was positive or negative. Finally, it prints the proper message.

Parameters:

prediction: float: The probability of a critic being negative
critic: str: The word sequence used as a critic

cnn.get_max_length(df)[source]

Computes maximum number of tokens given a dataframe.

This method is used to compute the maximum length found at the text column of the given dataframe. The column of the dataframe with the text is Summary. This function is useful if one decides to use padding for tokenization.

Parameters:

df: pandas.DataFrame: This is the dataset of the model.

Returns:

int: The maximum number of tokens found in a dataframe column.

cnn.manual_testing()[source]

Calls model upon custom critics.

In this method there are some movie critics defined to test the model with custom data.

cnn.nlp_preprocessor(text)[source]

Defines an NLP preprocessor.

This method takes some text and filters it. It deletes any non - alphanumeric character found. This is a standard preprocessing routine in machine learning models for NLP. It increases model’s performance.

Parameters:

text: str: This is the string to preprocess.

Returns:

str: The preprocessed - filtered string.

cnn.plot_loss_and_accuracy()[source]

Plots model’s fitting results.

This method takes the lists containing the training and the validation losses and plots them together. This method is useful when detecting an over-fitted model (or an under-fitted). The method saves the plot at the project directory.

cnn.predict_sentiment(sentence, min_len=5)[source]

Classifies a custom critic.

This method converts a sentence into arithmetic tokens. The tokens are then given to a trained model. The model predicts the sentiment of that critic.

Parameters:

sentence: str: The custom critic to be classified.
min_len: int (optional): The minimum length of tokens of the given sentence.

Returns:

float: The probability of the critic being negative.

cnn.train(iterator)[source]

Fits a model.

This method is used to fit the defined CNN model. The method provides progress context for the user using progressbar.

Parameters:

iterator: torchtext.data.Iterator: An iterator to load batches of training data from the given dataset.

Returns:

float: The epoch training loss
float: The epoch training accuracy

cnn.train_validate_test_split(df, seed, train_percent=0.7, validate_percent=0.1)[source]

Splits the given dataset.

This method splits a dataframe into:

A dataframe used to train the model
A dataframe used to validate the model
A dataframe used to test the model

The indexes of the given datasets are shuffled.

Parameters:

df: pandas.DataFrame: This is the given dataset.
seed: int: This is the seed used for the NumPy shuffler.
train_percent: float (optional): This is the dataset split ratio to get the sample data to fit the model.
validate_percent: float (optional): This is the dataset split ratio to get the sample data to validate the model.

cnn module

See Also

References