Binary Text Classifier For EEUU Elections Tweets 2020

Hi guys!

In this post, I want to share my project to create a Binary Text Classifier for Tweets using a DeepLearning technique with Tensorflow. The aim of this work is to classify tweets related to the USA elections 2020 to predict if a given tweet is in favor of Trump or Biden.

Important: this project has a GitHub repository and you can download the Jupyter Notebook used as well as the preprocessed datasets and the trained models.

Important (II): I have also made a web application to use the trained model with Flask and it is available in this GitHub repository. Follow the instructions of the README file to use it.

Important (III): all the imports are in the GitHub repository.

This post doesn't pretend to be a complete tutorial of Tensorflow, but it explains in detail the purpose of each function to fully understand the project and how the model has been designed.

WEB APPLICATION VIDEO

Get data for training the model.
Preprocessing.

Preprocessing. Step1: paths and constants.
Preprocessing. Step2: standardization.
Preprocessing. Step 3: filter English tweets.
Preprocessing. Step 4: apply preprocess.

Create an input pipeline.

Why have I done some preprocessing before building an input pipeline?
Building the input pipeline.

Building the model.

Text vectorization layer.
Configure datasets for performance.
Avoiding overfitting.
Model configuration.
Loss function and optimizer.
Training the model.
Deploy the model.

Conclusions. Is the model good enough?

GET DATA FOR TRAINING THE MODEL

First of all, data. We are going to use a Machine Learning model based on Supervised Learning, so we need data to train our model. We can find lots of datasets related not only to this particular problem but also to many others in Kaggle.

Kaggle is the world's largest data science community. There you can find datasets, tutorials, competitions in data science, forums, and it's free and open-source.

The dataset is in this link. When you enter, click on the download button and extract the zip file.

It contains two CSV files: one with tweets in favor of Trump and the other in favor of Biden.

There is also information about the posted date, the number of likes, retweets, and more, but, we only need the text of the tweet.

PREPROCESSING

The preprocessing consists of:

Get only the text of the tweet and ignore other attributes.
Standardized the text, which includes:

Lower case.
Remove line breaks.
Remove URLs.
Remove emojis.
Remove punctuation.

Filter English tweets (tweets in other languages are not going to be considered).

At the end of the preprocessing, we want three datasets: training, validation, and test. Each dataset is formed by two CSV files: one for Trump and one for Biden.

Important:

We are going to consider that the datasets are huge, which means that we can't load the full dataset in memory although, actually, we could do that.

The reason for that is explaining how we could build an input pipeline in Tensorflow for our model. In fact, I also decided to create six CSV files to simulate that we extract the information from different sources.

PREPROCESSING. STEP 1: PATHS AND CONSTANTS

We need the path of the two CSV files downloaded and we have to create six new paths to save the output of the preprocessing.

For that purpose, we have created a function called train_test_val_paths which creates the train, test, and validation path of a given path.

For example, for the path './input/hashtag_donaldtrump.csv', the output will be:

'./input/train_standarized_hashtag_donaldtrump.csv'
'./input/test_standarized_hashtag_donaldtrump.csv'
'./input/val_standarized_hashtag_donaldtrump.csv'

Code:

def train_test_val_paths(path):
    # Get directory and filename
    dirname = os.path.dirname(path)
    basename = os.path.basename(path)
    
    # Create path for each dataset
    train_path = os.path.join(dirname, 'train_standarized_' + basename)
    test_path = os.path.join(dirname, 'test_standarized_' + basename)
    val_path = os.path.join(dirname, 'val_standarized_' + basename)
    
    return train_path, test_path, val_path

The paths and the constants for the preprocessing will be:

Code:

BATCH_SIZE= 1000
TEST_SIZE = 0.15 
VAL_SIZE= 0.15

trump_path = './input/hashtag_donaldtrump.csv'
biden_path = './input/hashtag_joebiden.csv'

train_trump, test_trump, val_trump = train_test_val_paths(trump_path)
train_biden, test_biden, val_biden = train_test_val_paths(biden_path)

PREPROCESSING. STEP 2: STANDARDIZATION

Remember that the standardization function must be able to:

Lower case the text.
Remove line breaks.
Remove URLs.
Remove emojis.
Remove punctuation

For this purpose, Tensorflow has some functions that let you replace words that match with a regular expression or pattern. This is especially useful for URLs, emojis, and punctuation.

The regular expression for URLs is really simple and the regular expression for punctuation is already built-in Tensorflow.

However, for the emojis, we have to install the emoji package in our environment using pip install emoji. After that, we can get the pattern that will match with any emoji in the text by calling emoji.get_emoji_regexp().pattern.

Code:

def custom_standardization(text):
    # Lower case
    lower_text = tf.strings.lower(text)

    # Remove line breaks
    lower_text = tf.strings.regex_replace(input=lower_text, 
                                          pattern='\n', 
                                          rewrite=' ')

    # Remove URLs
    free_url_text = tf.strings.regex_replace(input=lower_text, 
                                             pattern="http\S+", 
                                             rewrite=' ')

    # Remove emojis
    emoji_pattern = emoji.get_emoji_regexp().pattern
    free_emoji_text = tf.strings.regex_replace(input=free_url_text, 
                                               pattern='[%s]' % re.escape(emoji_pattern),
                                               rewrite=' ')

    # Remove punctuation
    punctuation_pattern = string.punctuation
    free_punctuation_text =  tf.strings.regex_replace(free_emoji_text,
                                                      '[%s]' % re.escape(punctuation_pattern),
                                                      ' ')
    return free_punctuation_text.numpy().decode()

There are some important aspects to highlight:

This function uses Tensorflow built-in functions, which means that they are optimized to work with batches.
The input of the function must be tensors with text and only text.
The pattern in the emojis and punctuation step has '[%s]' because this pattern has metadata associated. Emojis represents images and punctuation symbols (not alphanumeric values). For more information, check the official documentation of re, the library used for regular expressions.
The output will be a NumPy array of strings because after standardizing the text, we are going to apply a python function to filter English tweets. So, we have to:

Get the NumPy array of the tensor.
Decode the content to get strings as Tensorflow works in byte-code.

PREPROCESSING. STEP 3: FILTER ENGLISH TWEETS

In the dataset, there are tweets written in other languages and it will be noisy data for our model. We are going to use the Natural Language ToolKit library (nltk) for this task.

Code:

def is_english(text):
    languages_ratios = {}
    tokens = wordpunct_tokenize(text)
    words = [word.lower() for word in tokens]

    # Compute per language included in nltk number of unique stopwords appearing in analyzed text
    for language in stopwords.fileids():
        stopwords_set = set(stopwords.words(language))
        words_set = set(words)
        common_elements = words_set.intersection(stopwords_set)

        languages_ratios[language] = len(common_elements) # language "score"
    
    most_rated_language = max(languages_ratios, key=languages_ratios.get)
      
    return most_rated_language == 'english'

The original function is from a user of Kaggle that applied it in one of their notebooks. The notebook is in this link.

This function receives a string and it returns true if the string is a sentence written in English. It's important to do this step after standardize because emojis and URLs can confuse this function due to they are considered as words also.

PREPROCESSING. STEP 4: APPLY PREPROCESS

As I mentioned previously, we are going to consider that our datasets are huge and we can't load the full dataset in memory. We are going to use chunks or, also called, batches.

A chunk is a little portion of the dataset that can be load in memory.

We are going to load chunks, apply the preprocessing to each one, and save them randomly in the train, validation, or test CSV file.

Code:

def get_tweets_split(path):
    
    train_path, test_path, val_path = train_test_val_paths(path)
    
    # Create an empty csv file to append chunks
    header_df = pd.DataFrame(columns=['tweet'])
    header_df.to_csv(train_path, index=False)
    header_df.to_csv(test_path, index=False)
    header_df.to_csv(val_path, index=False)
    
    # Define the probabilities to select each path
    TRAIN_SIZE = 1 - TEST_SIZE - VAL_SIZE
    probabilities = [TRAIN_SIZE, TEST_SIZE, VAL_SIZE]
    paths = [train_path, test_path, val_path]
    
    # Create the DataFrame Reader
    df = pd.read_csv(path, 
                     lineterminator='\n', 
                     chunksize=BATCH_SIZE, 
                     usecols=['tweet'])

    # Split the dataset
    for chunk in df:
        # Standarized tweets
        chunk['tweet'] = chunk['tweet'].apply(custom_standardization)
        
        # Remove not english tweets
        chunk = chunk[chunk['tweet'].map(is_english)]
        
        # Select a path
        path = np.random.choice(paths, p=probabilities)
        
        # Save
        chunk.to_csv(path, index=False, mode='a', header=False)

This function process one CSV file and we have two, so we have to call it twice:

Code:

get_tweets_split(path=trump_path)
get_tweets_split(path=biden_path)

CREATE THE INPUT PIPELINE

WHY HAVE I DONE SOME PREPROCESSING BEFORE BUILDING AN INPUT PIPELINE?

In Machine Learning is very common having huge amounts of data that must be preprocessing and feeding into a model. That's why Tensorflow provides lots of functions to build an input pipeline.

An input pipeline is an abstract interface that describes how the input must be processed.

For example, in our case, we can define an input pipeline that loads batches (remember that batches and chunks are portions of the dataset) of tweets, apply some functions to these batches like standardizing and filtering by language, and then feed the model with that preprocessed batch.

For this project, we have made some preprocessing before creating the input pipeline. We have already standardized and filtered by language the tweets.

The reason for that is that the function that filter by language is a python function that is not optimized to work with batches. This means that it processes tweets one by one and that considerably affects the performance of the model during training.

Besides, although the preprocessing is done during the first epoch of the training, the next epochs also have to do it, so the performance keeps being really bad.

In fact, after testing how increased the performance of doing the preprocessing first, I realized that the model trained 100 times faster (but the first preprocessing took two hours for my computer).

So, is it useful to build an input pipeline? The answer is definitely yes. We have been able to do this portion of preprocessing first because the dataset is small. Imagine that, instead, we would be working with 100 Tb or more. It would be a bad choice to save again that amount of data.

In real cases, we would have to implement an optimized function that works with batches using Tensorflow operations to filter by language, but this is not covered in this post.

We still consider that datasets are huge and we can't load the full dataset in memory. But, this is an exception to avoid being training for a week on my computer 😅.

BUILDING AN INPUT PIPELINE

To build the input pipeline we are going to use tf.data.Dataset class. Tensorflow uses eager execution, which means that we can create Datasets and apply operations in these Datasets but, the actual computation is not done until it is needed.

For example, we can indicate that we want to load a CSV file in batches of 100 rows, then add a new column to the batch, combine each batch with other batch, shuffle them, and more.

But, these operations of loading, concatenating, combining, and shuffling are not going to be done until you actually need the data, for example, to print it on a screen or to train a Machine Learning model.

So, we can define all the operations that are needed before feeding the model with the data, and then, when we want to actually train, these operations will be executed. That's an input pipeline.

First, we have to create a tf.data.The dataset from a CSV file. Tensorflow has a specific function for that called make_csv_dataset.

Code:

def create_dataset(path):
    
    # Create the dataset
    ds = tf.data.experimental.make_csv_dataset(path, 
                                              batch_size=BATCH_SIZE, 
                                              select_columns=['tweet'],
                                              num_epochs=1)
    # Get the data in the tweet column
    ds = ds.map(lambda x: x['tweet'])

    return ds

Important: the parameter num_epochs controls the number of repetitions of our dataset. For example, if we have a CSV file with 10 rows and we set the num_epochs to 2, the resultant tf.data.Dataset will have 20 rows. If we don't specify this parameter, the resultant tf.data.Dataset will have an infinite number of repetitions of the original CSV file and we would never finish one epoch during training.

Second, we have to add the class attribute that differentiates tweets between Trump and Biden. Class 0 will be Trump and class 1 Biden.

Remember: we are working with batches and we have to add one label for each tweet.

Code:

def add_class_column(dataset, class_value):
    return dataset.map(lambda x: (x, tf.repeat(class_value, tf.size(x))))

Finally, we are going to create a new tf.data.Dataset combining tweets from Trump and Biden.

For that, Tensorflow has a function called sample_from_datasets, which takes elements from the given datasets randomly and combines them into a single tf.data.Dataset.

Code:

def input_pipeline(trump_path, biden_path, class_atributte=True):
    
    trump_ds = create_dataset(trump_path)
    biden_ds = create_dataset(biden_path)
    
    # Add class atributte
    if class_atributte:
        trump_ds = add_class_column(trump_ds, 0)
        biden_ds = add_class_column(biden_ds, 1)
    
    datasets = [
        trump_ds.unbatch(),
        biden_ds.unbatch()
    ]
    
    # Equally merge datasets
    trump_biden_ds = tf.data.experimental.sample_from_datasets(datasets=datasets, 
                                                               weights=[0.5, 0.5])
        
    # Batch the dataset
    trump_biden_ds = trump_biden_ds.batch(BATCH_SIZE)
    
    return trump_biden_ds

There are some important aspects to highlight:

Before calling the sample_from_datasets function, we have to unbatch the batches to get individual tweets. If we wouldn't do that, we will be combining batches, not individual tweets.
The parameter weights indicate the probability of select one element from each dataset.
After combining the datasets, we have to batch again. Remember that Tensorflow is optimized to work with batches.

The full process is represented in the next image:

We have to call the input_pipeline function for the train, validation, and test dataset.

Code:

train_ds = input_pipeline(train_trump, train_biden, class_atributte=True)
val_ds = input_pipeline(val_trump, val_biden, class_atributte=True)
test_ds = input_pipeline(test_trump, test_biden, class_atributte=True)

BUILDING THE MODEL

The DeepLearning model that we are going to use is a Bidirectional Recurrent Neuronal Network or Bidirectional RNN. An RNN is a Neuronal Network (NN) whose neurons output not only feeds the input of the next layer but also feeds the input of the consecutive neuron in the same layer.

This is useful for text analysis because each neuron has information of the previous words.

A Bidirectional RNN uses the same logic but, the inputs are propagated back and forward in each layer, which means that each neuron knows the context of each word in the sentence, as it has information about previous and next words.

The complete diagram of the model that we are going to use is shown above.

Note: this image has been extracted from the Text classification with an RNN tutorial of TensorFlow.

The steps that take the model are:

Associate an index to each word of the sentence (TextVectorization).
Associate a vector to each word-index (Embedding). Vector values will change during training and words with similar meaning will end up having similar vectors associated.
Feed a Bidirectional RNN with these vectors.
Important: we add here a regular NN layer of 16 neurons (this is not shown in the diagram).
Feed a single neuron with the output of the previous layer and this neuron will return a value between 0 and 1.

TEXT VECTORIZATION LAYER

This layer is responsible for:

Standardize the input sentence. By default, it will lower case the sentence and remove punctuation.
Create tokens for each word. By default it will split the sentence by blanks.
Associate an index to each token (word).

Code:

MAX_FEATURES = 5000
SEQUENCE_LENGTH = 260 

vectorize_layer = TextVectorization(
    max_tokens=MAX_FEATURES,
    output_mode='int',
    output_sequence_length=SEQUENCE_LENGTH
)

There are some aspects to highlight:

We have already standardized the input but, if we want to use our standardized function, we have to add a decorator to that function to be able to save the model.
We are going to use the default behavior to tokenize the sentence, but we could define our own function for that.
The TextVectorization layer doesn't associate an index to every unique word. It takes the most common ones and creates a vocabulary and those less common will have a default index value. The max_tokens parameter indicates the size of the vocabulary.
The output_mode parameter indicates the data-type of the index uses for each word.
The output_sequence_length indicates the length of the output of the TextVectoriztion layer. For example, if we have a sentence with 100 words and we want 80 in the output, this layer will remove the last 20 words.

Note: we are using 260 as output_sequence_length because, after studying the frequency distribution of the length of the tweets, more than 90% of them have less than 260 words.

We have mentioned that the TextVectorization layer creates a vocabulary with the most common words. To get that, we have to call a method called adapt and introduce all the tweets that we are going to use for training.

Code:

train_text = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

After that, we can apply the TextVectorization layer to our tf.data.Datasets.

Code:

def vectorize_text(text, label):
    text = tf.expand_dims(text, -1)
    return vectorize_layer(text), label
    

train_ds = train_ds.map(vectorize_text)
val_ds = val_ds.map(vectorize_text)
test_ds = test_ds.map(vectorize_text)

CONFIGURE DATASETS FOR PERFORMANCE

We can configure our datasets to use the cache memory to optimize I/O operations using the next code.

Code:

AUTOTUNE = tf.data.experimental.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

AVOIDING OVERFITTING

DeepLearning models use to overfit really fast. To avoid overfitting, we will use two methods:

Add dropout layers. Any neuron in a Neuronal Network must be more important than the others. This means that if we remove a neuron during training, the NN should still working. A dropout layer removes neuron outputs randomly to increase the difficulty during training and to ensure that any neuron gets more importance than others.
Use regularizers. Any connection in a Neuronal Network must be more important than the others. This means that there shouldn't be extremely high weights. To avoid that a regularizer penalizes high weights values.

For more information about overfitting check the Tensorflow tutorial related to this topic in this link.

MODEL CONFIGURATION

To summarize, our model will have these layers:

An embedding layer.
A dropout layer with a 20% probability of dropping.
A bidirectional RNN layer with an L2 regularizer.
A regular NN layer with an L2 regularizer.
A dropout layer with a 20% probability of dropping.
A single neuron layer.

I have trained the same model with three different sets of parameters:

Big model:

MAX_FEATURES: 10000
Neurons in the bidirectional RNN layer and regular NN layer: 16.

Middle model:

MAX_FEATURES: 10000
Neurons in the bidirectional RNN layer and regular NN layer: 8.

Little model:

MAX_FEATURES: 5000
Neurons in the bidirectional RNN layer and regular NN layer: 8.

All of them with similar performance. I show above the big model configuration.

Code:

EMBEDDING_DIM = 16

model = tf.keras.Sequential([
    
    layers.Embedding(
        input_dim=MAX_FEATURES + 1,
        output_dim=EMBEDDING_DIM,
        # Use masking to handle the variable sequence lengths
        mask_zero=True),
    
    layers.Dropout(0.2),
    
    layers.Bidirectional(
        tf.keras.layers.LSTM(16, 
                             kernel_regularizer=regularizers.l2(0.0001)
                            )
    ),
    
    layers.Dense(16, 
                 kernel_regularizer=regularizers.l2(0.0001),
                 activation='relu'
                ),
    
    layers.Dropout(0.2),
    
    layers.Dense(1)
])

LOSS FUNCTION AND OPTIMIZER

As we are training a Binary Classifier, our loss function will be the Binary Cross Entropy. To optimize the model we are going to use an Adam optimizer (an extension of the stochastic gradient descent) and to measure the performance, we are going to use the Binary Accuracy.

Code:

model.compile(loss=losses.BinaryCrossentropy(from_logits=True),
              optimizer='adam',
              metrics=tf.metrics.BinaryAccuracy(threshold=0.0))

TRAINING THE MODEL

Finally, we only have to train the model by calling the fit method.

Code:

epochs = 25
history = model.fit(
    train_ds,
    validation_data=val_ds,
    epochs=epochs)

This process took me more than 12 hours but, I'm not using graphic acceleration, so it could be less on your computer if you use it.

The training history can be plot using the next code:

Code:

h = history.history

loss = h.get('loss')
binary_accuracy = h.get('binary_accuracy')
val_loss = h.get('val_loss')
val_binary_accuracy = h.get('val_binary_accuracy')
epochs = range(1, len(loss) + 1)


plt.figure(figsize=(12,6))

plt.plot(epochs, binary_accuracy, 'g', label='binary_accuracy')
plt.plot(epochs, val_binary_accuracy, '--g', label='val_binary_accuracy')

plt.title('Training and validation binary_accuracy')
plt.xlabel('Epochs')
plt.ylabel('Binary accuracy')
plt.legend()

We can see that after the first epoch, our model has learned really well the training data and quickly get his higher accuracy around 88%.

The validation data varies a lot at the beginning of the training but after 15 epochs it stabilizes between 80% and 83% accuracy.

Finally, after proving the trained model with the validation and testing data, the model classifies properly 81% of the tweets.

DEPLOY THE MODEL

To deploy the model and use it in a real application, it would be convenient to add the TextVectorization layer in the model.

Currently, we apply the TextVectorization layer to the date before feeding the model but, we would like to introduce strings directly and forget that there is a TextVectorization step before.

So, we can add this layer to the model simply as it's shown above:

Code:

deployable_model = tf.keras.Sequential([
    vectorize_layer,
    new_model,
])

deployable_model.compile(
    loss=losses.BinaryCrossentropy(from_logits=False), 
    optimizer="adam", 
    metrics=['accuracy']
)

To save this model, we have to use it first to evaluate something because it doesn't know the shape of the input.

First, we have to recreate the input pipeline, without the TextVectorization layer applied.

Code:

train_ds = input_pipeline(train_trump, train_biden, class_atributte=True)
val_ds = input_pipeline(val_trump, val_biden, class_atributte=True)
test_ds = input_pipeline(test_trump, test_biden, class_atributte=True)

Now, we can evaluate with the validation or the test dataset.

Code:

deployable_model.evaluate(val_ds)
deployable_model.evaluate(test_ds)

Finally, we can save the model:

Code:

deployable_model.save('./saved_model/deployable_big_model')

Important: If we try to predict with strings, the model will return values negative and positive. If the prediction is negative, the class associated is 0 and if it is positive, the class is associated 1.

CONCLUSIONS. IS THE MODEL GOOD ENOUGH?

The results indicate that 4 of 5 tweets are classified properly which can be enough or not depending on the case.

We have to take into account that some tweets are really sorts and most of them have multimedia content associated with an URL which is not considered and can be determinant to properly understand the purpose of the tweet.

However, if we want to have a better performance, we can combine this technique with others to create a more complex model. This is one of the best points of Tensorflow, it easily lets you add new layers or new models to the current one.

I hope you find this post useful and feel free to share or use this code in your own projects. This work is completely Open Source 😄.

Remember: this project has a GitHub repository and you can download the Jupyter Notebook used as well as the preprocessed datasets and the trained model.

Remember (II): I have also made a web application to use the trained model with Flask and it is available in this GitHub repository. Follow the instructions of the README file to use it.

Remember (III): all the imports are in the GitHub repository.

Keep working, keep studying, keep learning and you will master anything!

Have a nice day!

Search This Blog

Raúl Castilla Bravo