Binary Text Classifier For EEUU Elections Tweets 2020 - DeepLearning with Tensorflow
Get link
Facebook
X
Pinterest
Email
Other Apps
-
Hi guys!
In this post, I want to share my project to create a Binary Text Classifier for Tweets using a DeepLearning technique with Tensorflow. The aim of this work is to classify tweets related to the USA elections 2020 to predict if a given tweet is in favor of Trump or Biden.
Important: this project has a GitHub repository and you can download the Jupyter Notebook used as well as the preprocessed datasets and the trained models.
Important (II): I have also made a web application to use the trained model with Flask and it is available in this GitHub repository. Follow the instructions of the README file to use it.
Important (III): all the imports are in the GitHub repository.
This post doesn't pretend to be a complete tutorial of Tensorflow, but it explains in detail the purpose of each function to fully understand the project and how the model has been designed.
WEB APPLICATION VIDEO
TABLE OF CONTENTS
Get data for training the model.
Preprocessing.
Preprocessing. Step1: paths and constants.
Preprocessing. Step2: standardization.
Preprocessing. Step 3: filter English tweets.
Preprocessing. Step 4: apply preprocess.
Create an input pipeline.
Why have I done some preprocessing before building an input pipeline?
Building the input pipeline.
Building the model.
Text vectorization layer.
Configure datasets for performance.
Avoiding overfitting.
Model configuration.
Loss function and optimizer.
Training the model.
Deploy the model.
Conclusions. Is the model good enough?
GET DATA FOR TRAINING THE MODEL
First of all, data. We are going to use a Machine Learning model based on Supervised Learning, so we need data to train our model. We can find lots of datasets related not only to this particular problem but also to many others in Kaggle.
Kaggle is the world's largest data science community. There you can find datasets, tutorials, competitions in data science, forums, and it's free and open-source.
The dataset is in this link. When you enter, click on the download button and extract the zip file.
It contains two CSV files: one with tweets in favor of Trump and the other in favor of Biden.
There is also information about the posted date, the number of likes, retweets, and more, but, we only need the text of the tweet.
PREPROCESSING
The preprocessing consists of:
Get only the text of the tweet and ignore other attributes.
Standardized the text, which includes:
Lower case.
Remove line breaks.
Remove URLs.
Remove emojis.
Remove punctuation.
Filter English tweets (tweets in other languages are not going to be considered).
At the end of the preprocessing, we want three datasets: training, validation, and test. Each dataset is formed by two CSV files: one for Trump and one for Biden.
Important:
We are going to consider that the datasets are huge, which means that we can't load the full dataset in memory although, actually, we could do that.
The reason for that is explaining how we could build an input pipeline in Tensorflow for our model. In fact, I also decided to create six CSV files to simulate that we extract the information from different sources.
PREPROCESSING. STEP 1: PATHS AND CONSTANTS
We need the path of the two CSV files downloaded and we have to create six new paths to save the output of the preprocessing.
For that purpose, we have created a function called train_test_val_paths which creates the train, test, and validation path of a given path.
For example, for the path './input/hashtag_donaldtrump.csv', the output will be:
deftrain_test_val_paths(path):# Get directory and filenamedirname=os.path.dirname(path)basename=os.path.basename(path)# Create path for each datasettrain_path=os.path.join(dirname,'train_standarized_'+basename)test_path=os.path.join(dirname,'test_standarized_'+basename)val_path=os.path.join(dirname,'val_standarized_'+basename)returntrain_path,test_path,val_path
The paths and the constants for the preprocessing will be:
Remember that the standardization function must be able to:
Lower case the text.
Remove line breaks.
Remove URLs.
Remove emojis.
Remove punctuation
For this purpose, Tensorflow has some functions that let you replace words that match with a regular expression or pattern. This is especially useful for URLs, emojis, and punctuation.
The regular expression for URLs is really simple and the regular expression for punctuation is already built-in Tensorflow.
However, for the emojis, we have to install the emoji package in our environment using pip install emoji. After that, we can get the pattern that will match with any emoji in the text by calling emoji.get_emoji_regexp().pattern.
This function uses Tensorflow built-in functions, which means that they are optimized to work with batches.
The input of the function must be tensors with text and only text.
The pattern in the emojis and punctuation step has '[%s]' because this pattern has metadata associated. Emojis represents images and punctuation symbols (not alphanumeric values). For more information, check the official documentation of re, the library used for regular expressions.
The output will be a NumPy array of strings because after standardizing the text, we are going to apply a python function to filter English tweets. So, we have to:
Get the NumPy array of the tensor.
Decode the content to get strings as Tensorflow works in byte-code.
PREPROCESSING. STEP 3: FILTER ENGLISH TWEETS
In the dataset, there are tweets written in other languages and it will be noisy data for our model. We are going to use the Natural Language ToolKit library (nltk) for this task.
Code:
defis_english(text):languages_ratios={}tokens=wordpunct_tokenize(text)words=[word.lower()forwordintokens]# Compute per language included in nltk number of unique stopwords appearing in analyzed textforlanguageinstopwords.fileids():stopwords_set=set(stopwords.words(language))words_set=set(words)common_elements=words_set.intersection(stopwords_set)languages_ratios[language]=len(common_elements)# language "score"most_rated_language=max(languages_ratios,key=languages_ratios.get)returnmost_rated_language=='english'
The original function is from a user of Kaggle that applied it in one of their notebooks. The notebook is in this link.
This function receives a string and it returns true if the string is a sentence written in English. It's important to do this step after standardize because emojis and URLs can confuse this function due to they are considered as words also.
PREPROCESSING. STEP 4: APPLY PREPROCESS
As I mentioned previously, we are going to consider that our datasets are huge and we can't load the full dataset in memory. We are going to use chunks or, also called, batches.
A chunk is a little portion of the dataset that can be load in memory.
We are going to load chunks, apply the preprocessing to each one, and save them randomly in the train, validation, or test CSV file.
Code:
defget_tweets_split(path):train_path,test_path,val_path=train_test_val_paths(path)# Create an empty csv file to append chunksheader_df=pd.DataFrame(columns=['tweet'])header_df.to_csv(train_path,index=False)header_df.to_csv(test_path,index=False)header_df.to_csv(val_path,index=False)# Define the probabilities to select each pathTRAIN_SIZE=1-TEST_SIZE-VAL_SIZEprobabilities=[TRAIN_SIZE,TEST_SIZE,VAL_SIZE]paths=[train_path,test_path,val_path]# Create the DataFrame Readerdf=pd.read_csv(path,lineterminator='\n',chunksize=BATCH_SIZE,usecols=['tweet'])# Split the datasetforchunkindf:# Standarized tweetschunk['tweet']=chunk['tweet'].apply(custom_standardization)# Remove not english tweetschunk=chunk[chunk['tweet'].map(is_english)]# Select a pathpath=np.random.choice(paths,p=probabilities)# Savechunk.to_csv(path,index=False,mode='a',header=False)
This function process one CSV file and we have two, so we have to call it twice:
WHY HAVE I DONE SOME PREPROCESSING BEFORE BUILDING AN INPUT PIPELINE?
In Machine Learning is very common having huge amounts of data that must be preprocessing and feeding into a model. That's why Tensorflow provides lots of functions to build an input pipeline.
An input pipeline is an abstract interface that describes how the input must be processed.
For example, in our case, we can define an input pipeline that loads batches (remember that batches and chunks are portions of the dataset) of tweets, apply some functions to these batches like standardizing and filtering by language, and then feed the model with that preprocessed batch.
For this project, we have made some preprocessing before creating the input pipeline. We have already standardized and filtered by language the tweets.
The reason for that is that the function that filter by language is a python function that is not optimized to work with batches. This means that it processes tweets one by one and that considerably affects the performance of the model during training.
Besides, although the preprocessing is done during the first epoch of the training, the next epochs also have to do it, so the performance keeps being really bad.
In fact, after testing how increased the performance of doing the preprocessing first, I realized that the model trained 100 times faster (but the first preprocessing took two hours for my computer).
So, is it useful to build an input pipeline? The answer is definitely yes. We have been able to do this portion of preprocessing first because the dataset is small. Imagine that, instead, we would be working with 100 Tb or more. It would be a bad choice to save again that amount of data.
In real cases, we would have to implement an optimized function that works with batches using Tensorflow operations to filter by language, but this is not covered in this post.
We still consider that datasets are huge and we can't load the full dataset in memory. But, this is an exception to avoid being training for a week on my computer 😅.
BUILDING AN INPUT PIPELINE
To build the input pipeline we are going to use tf.data.Dataset class. Tensorflow uses eager execution, which means that we can create Datasets and apply operations in these Datasets but, the actual computation is not done until it is needed.
For example, we can indicate that we want to load a CSV file in batches of 100 rows, then add a new column to the batch, combine each batch with other batch, shuffle them, and more.
But, these operations of loading, concatenating, combining, and shuffling are not going to be done until you actually need the data, for example, to print it on a screen or to train a Machine Learning model.
So, we can define all the operations that are needed before feeding the model with the data, and then, when we want to actually train, these operations will be executed. That's an input pipeline.
First, we have to create a tf.data.The dataset from a CSV file. Tensorflow has a specific function for that called make_csv_dataset.
Code:
defcreate_dataset(path):# Create the datasetds=tf.data.experimental.make_csv_dataset(path,batch_size=BATCH_SIZE,select_columns=['tweet'],num_epochs=1)# Get the data in the tweet columnds=ds.map(lambdax:x['tweet'])returnds
Important: the parameter num_epochs controls the number of repetitions of our dataset. For example, if we have a CSV file with 10 rows and we set the num_epochs to 2, the resultant tf.data.Dataset will have 20 rows. If we don't specify this parameter, the resultant tf.data.Dataset will have an infinite number of repetitions of the original CSV file and we would never finish one epoch during training.
Second, we have to add the class attribute that differentiates tweets between Trump and Biden. Class 0 will be Trump and class 1 Biden.
Remember: we are working with batches and we have to add one label for each tweet.
Finally, we are going to create a new tf.data.Dataset combining tweets from Trump and Biden.
For that, Tensorflow has a function called sample_from_datasets, which takes elements from the given datasets randomly and combines them into a single tf.data.Dataset.
Code:
definput_pipeline(trump_path,biden_path,class_atributte=True):trump_ds=create_dataset(trump_path)biden_ds=create_dataset(biden_path)# Add class atributteifclass_atributte:trump_ds=add_class_column(trump_ds,0)biden_ds=add_class_column(biden_ds,1)datasets=[trump_ds.unbatch(),biden_ds.unbatch()]# Equally merge datasetstrump_biden_ds=tf.data.experimental.sample_from_datasets(datasets=datasets,weights=[0.5,0.5])# Batch the datasettrump_biden_ds=trump_biden_ds.batch(BATCH_SIZE)returntrump_biden_ds
There are some important aspects to highlight:
Before calling the sample_from_datasets function, we have to unbatch the batches to get individual tweets. If we wouldn't do that, we will be combining batches, not individual tweets.
The parameter weights indicate the probability of select one element from each dataset.
After combining the datasets, we have to batch again. Remember that Tensorflow is optimized to work with batches.
The full process is represented in the next image:
We have to call the input_pipeline function for the train, validation, and test dataset.
The DeepLearning model that we are going to use is a Bidirectional Recurrent Neuronal Network or Bidirectional RNN. An RNN is a Neuronal Network (NN) whose neurons output not only feeds the input of the next layer but also feeds the input of the consecutive neuron in the same layer.
This is useful for text analysis because each neuron has information of the previous words.
A Bidirectional RNN uses the same logic but, the inputs are propagated back and forward in each layer, which means that each neuron knows the context of each word in the sentence, as it has information about previous and next words.
The complete diagram of the model that we are going to use is shown above.
Associate an index to each word of the sentence (TextVectorization).
Associate a vector to each word-index (Embedding). Vector values will change during training and words with similar meaning will end up having similar vectors associated.
Feed a Bidirectional RNN with these vectors.
Important: we add here a regular NN layer of 16 neurons (this is not shown in the diagram).
Feed a single neuron with the output of the previous layer and this neuron will return a value between 0 and 1.
TEXT VECTORIZATION LAYER
This layer is responsible for:
Standardize the input sentence. By default, it will lower case the sentence and remove punctuation.
Create tokens for each word. By default it will split the sentence by blanks.
We have already standardized the input but, if we want to use our standardized function, we have to add a decorator to that function to be able to save the model.
We are going to use the default behavior to tokenize the sentence, but we could define our own function for that.
The TextVectorization layer doesn't associate an index to every unique word. It takes the most common ones and creates a vocabulary and those less common will have a default index value. The max_tokens parameter indicates the size of the vocabulary.
The output_mode parameter indicates the data-type of the index uses for each word.
The output_sequence_length indicates the length of the output of the TextVectoriztion layer. For example, if we have a sentence with 100 words and we want 80 in the output, this layer will remove the last 20 words.
Note: we are using 260 as output_sequence_length because, after studying the frequency distribution of the length of the tweets, more than 90% of them have less than 260 words.
We have mentioned that the TextVectorization layer creates a vocabulary with the most common words. To get that, we have to call a method called adapt and introduce all the tweets that we are going to use for training.
DeepLearning models use to overfit really fast. To avoid overfitting, we will use two methods:
Add dropout layers. Any neuron in a Neuronal Network must be more important than the others. This means that if we remove a neuron during training, the NN should still working. A dropout layer removes neuron outputs randomly to increase the difficulty during training and to ensure that any neuron gets more importance than others.
Use regularizers. Any connection in a Neuronal Network must be more important than the others. This means that there shouldn't be extremely high weights. To avoid that a regularizer penalizes high weights values.
For more information about overfitting check the Tensorflow tutorial related to this topic in this link.
MODEL CONFIGURATION
To summarize, our model will have these layers:
An embedding layer.
A dropout layer with a 20% probability of dropping.
A bidirectional RNN layer with an L2 regularizer.
A regular NN layer with an L2 regularizer.
A dropout layer with a 20% probability of dropping.
A single neuron layer.
I have trained the same model with three different sets of parameters:
Big model:
MAX_FEATURES: 10000
Neurons in the bidirectional RNN layer and regular NN layer: 16.
Middle model:
MAX_FEATURES: 10000
Neurons in the bidirectional RNN layer and regular NN layer: 8.
Little model:
MAX_FEATURES: 5000
Neurons in the bidirectional RNN layer and regular NN layer: 8.
All of them with similar performance. I show above the big model configuration.
Code:
EMBEDDING_DIM=16model=tf.keras.Sequential([layers.Embedding(input_dim=MAX_FEATURES+1,output_dim=EMBEDDING_DIM,# Use masking to handle the variable sequence lengthsmask_zero=True),layers.Dropout(0.2),layers.Bidirectional(tf.keras.layers.LSTM(16,kernel_regularizer=regularizers.l2(0.0001))),layers.Dense(16,kernel_regularizer=regularizers.l2(0.0001),activation='relu'),layers.Dropout(0.2),layers.Dense(1)])
LOSS FUNCTION AND OPTIMIZER
As we are training a Binary Classifier, our loss function will be the Binary Cross Entropy. To optimize the model we are going to use an Adam optimizer (an extension of the stochastic gradient descent) and to measure the performance, we are going to use the Binary Accuracy.
This process took me more than 12 hours but, I'm not using graphic acceleration, so it could be less on your computer if you use it.
The training history can be plot using the next code:
Code:
h=history.historyloss=h.get('loss')binary_accuracy=h.get('binary_accuracy')val_loss=h.get('val_loss')val_binary_accuracy=h.get('val_binary_accuracy')epochs=range(1,len(loss)+1)plt.figure(figsize=(12,6))plt.plot(epochs,binary_accuracy,'g',label='binary_accuracy')plt.plot(epochs,val_binary_accuracy,'--g',label='val_binary_accuracy')plt.title('Training and validation binary_accuracy')plt.xlabel('Epochs')plt.ylabel('Binary accuracy')plt.legend()
We can see that after the first epoch, our model has learned really well the training data and quickly get his higher accuracy around 88%.
The validation data varies a lot at the beginning of the training but after 15 epochs it stabilizes between 80% and 83% accuracy.
Finally, after proving the trained model with the validation and testing data, the model classifies properly 81% of the tweets.
DEPLOY THE MODEL
To deploy the model and use it in a real application, it would be convenient to add the TextVectorization layer in the model.
Currently, we apply the TextVectorization layer to the date before feeding the model but, we would like to introduce strings directly and forget that there is a TextVectorization step before.
So, we can add this layer to the model simply as it's shown above:
Important: If we try to predict with strings, the model will return values negative and positive. If the prediction is negative, the class associated is 0 and if it is positive, the class is associated 1.
CONCLUSIONS. IS THE MODEL GOOD ENOUGH?
The results indicate that 4 of 5 tweets are classified properly which can be enough or not depending on the case.
We have to take into account that some tweets are really sorts and most of them have multimedia content associated with an URL which is not considered and can be determinant to properly understand the purpose of the tweet.
However, if we want to have a better performance, we can combine this technique with others to create a more complex model. This is one of the best points of Tensorflow, it easily lets you add new layers or new models to the current one.
I hope you find this post useful and feel free to share or use this code in your own projects. This work is completely Open Source 😄.
Remember: this project has a GitHub repository and you can download the Jupyter Notebook used as well as the preprocessed datasets and the trained model.
Remember (II): I have also made a web application to use the trained model with Flask and it is available in this GitHub repository. Follow the instructions of the README file to use it.
Remember (III): all the imports are in the GitHub repository.
Keep working, keep studying, keep learning and you will master anything!
Too much interesting. Great post, have a nice day!
ReplyDeleteThank you so much Francisco! Have a nice day!
Delete