Week 8: Training & Testing a Humor Detection Machine Learning Model

May 25, 2020

Intro

Hi everyone! Welcome back to my blog!

This week has been quite eventful. I’ve been able to train a machine learning model using a Random Forest Classifier to classify humor!

In this blog post, I will talk about the following:

  1. Preprocessing
  2. Why Random Forest?
  3. How Does Random Forest Work?
  4. Challenges
  5. Training Random Forest
  6. Test Results
  7. Next Steps

 

Preprocessing

There are numerous preprocessing steps that I had to perform before I could train my machine learning humor detection model. They are listed below:

1. Noise Removal: The line of code ‘sentence = re.sub(‘[^a-zA-Z]’,”,word)‘ only keeps words and removes all noise from the text, including HTML tags (\n, \t, etc.), extra whitespace, punctuation, special characters, etc. This helps clean up the text a lot!

2. Lowercasing: The line of code ‘sentence = sentence.lower()‘ transforms all uppercase letters to lowercase.

3. Tokenization: This transforms a sentence into a list of words.

4. Removing Stop Words: Stop words are common words that appear too often to help in text classification (low predictive power). Examples include “I,” “he,” “to,” “a,” etc. Removing stop words is essential, for if they are not removed, the machine learning model will think stop words have high predictive power. The machine learning model will then erroneously use them to determine whether a sentence is funny or not, when in reality stop words commonly occur in both funny and not funny sentences.

5. Stemming or Lemmatizing: Both attempt to transform words to their root form.

Stemming: the more aggressive of the two, stemming chops off prefixes and suffixes, which may result in a non-English words and cause words to lose their meaning. Ex: studied/studies –> studi, eating/eats –> eat, caring –> car

Lemmatizing: the more computationally expensive of the two, lemmatizing transforms versions of a word (ex. studies/studied/study/studying) to one base (ex. study). Unlike stemming, the base is always an English word. Lemmatizing requires part-of-speech tagging in order to be accurate.

Although it seems much better than stemming, its advantage in performance has been found to be small. I’ve found my model’s performance to be slightly better with lemmatizing than stemming, as stemming is a more crude method and has more room for error.

6. Text Vectorization: using TF-IDF (Term Frequency – Inverse Document Frequency) – if you need a refresher on TF-IDF, I recommend you check out my previous blog!

After pre-processing, here is what the first five sentences in my humor dataset look like:

As you can see, each word in each sentence has been lowercased, stop words have been removed, each word has been lemmatized, and last but not least, each sentence has been transformed into a 1500-dimensional vector by TF-IDF.

 

Why Random Forest?

There are many machine learning classification algorithms that I could have chosen from. Some include Naive Bayes Classifier, KNN (K Nearest Neighbors), SVM (Support Vector Machines), Logistic Regression, Decision Tree Classifier, and Random Forest Classifier. Each has its own strengths and weaknesses.

Through my experience of working with these various classification algorithms, I’ve found KNN and the Random Forest Classifier to perform extremely well. For spam detection, the performance of 15 machine learning algorithms is shown below, and while each algorithm does quite well in this challenging natural language processing (NLP) classification task, the Random Forest Classifier returns the highest accuracy, precision, f1 score, and kappa [1].

To learn more about the various machine learning algorithms, check out this awesome website: https://towardsdatascience.com/nlp-classification-in-python-pycaret-approach-vs-the-traditional-approach-602d38d29f06 [1].

Not only does Random Forest Classifier algorithm routinely perform exceedingly well in classification tasks, but I had also implemented a Flask app service with a Random Forest Classifier for fruit classification earlier in my project. Hence, more reason to choose the Random Forest Classifier over the others for humor detection!

 

How Does Random Forest Work?

Before I jump into my code and detail my experience training and testing the random forest classifier, I want to briefly explain how random forest works, for I know many of you are likely curious. To understand how random forest works, we must first understand how the decision tree classifier works, as random forest is built off of decision tree.

Let us consider the task of flower classification. As shown below, you can see part of the Iris dataset. The full dataset contains 150 flowers, 50 of each type (50 Iris Setosa, 50 Iris Versicolour, and 50 Iris Virginica). The independent variables are (1) sepal length in cm, (2) sepal width in cm, (3) petal length in cm, and (4) petal width in cm. The full Iris dataset can be found here: https://archive.ics.uci.edu/ml/datasets/iris.

From these independent variables, the decision tree classifier constructs a tree, and then it uses that tree to predict whether a flower is an Iris Setosa, Iris Versicolour, or Iris Virginica.

But how does the decision tree classifier construct the tree? Without getting too into the details, at each parent node, the algorithm finds out what will split the dataset about in half that would yield the highest information gain. Essentially, the decision tree classifier algorithm attempts to distinguish between the different labels (in this case – flowers) by creating the least number of divisions and child nodes.

Below, the decision tree for flower classification is shown:

This is just one of the infinite decision trees that could have been created. However, it is one of the better decision trees. The decision tree algorithm is able to understand that most Iris Setosa flowers have a petal length of less than or equal to 2.45 cm. Just through the first division in the decision tree, most of the Iris Setosa can be recognized and correctly classified as Iris Setosa, yielding a huge information gain. The second division (petal width <= 1.75) also is able to for the most part separate out the Iris Versicolors (petal width <= 1.75) from the Iris Virginicas (petal width > 1.75).

The Decision Tree Classifier has some drawbacks. Repeatedly calculating information gain (either with entropy or Gini index) is computationally expensive. Trees can become extremely complex, and the decision tree can also oversimplify the intricacies and complexity related with a classification task. However, the Decision Tree Classifier algorithm is versatile, easy to understand, able to deal with missing data points, and most importantly, it is able to achieve a pretty amazing accuracy!

To learn more about the decision tree classifier, check out this website: https://www.xoriant.com/blog/product-engineering/decision-trees-machine-learning-algorithm.html.

The Random Forest Classifier algorithm partially fixes the Decision Tree Classifier’s potential overfitting issue. Instead of just creating one tree to use to make predictions, Random Forest constructs multiple decision trees while training. It then uses majority voting to determine whether something should be classified as X or Y. For instance, if 2 trees classify a sentence as funny, but 1 trees classify the same sentence as not funny, Random Forest would predict that the sentence is funny. Pretty clever!

 

Challenges

Training the Random Forest model turned out much more difficult than anticipated. I ran into two major problems:

1. Runtime

Training the Random Forest model took forever. I would start running the python program at noon one day, and after waking up the next day, the program would still be running! However, once I thought about what was happening, the extraordinarily long runtime made sense. Each sentence was a 1500-dimensional vector, and there were over 400,000 sentences in the humor dataset. Furthermore, each decision tree likely had hundreds or thousands of nodes!

2. Connection Error

However, the long runtime had a larger impact on my project: it led to a connection error. While attempting to utilize the self-sufficient Flask service I had built earlier in my project to train the random forest model, the POST request I sent with the humor dataset kept timing out and returning a connection error.

I tried to fix this problem through various methods. First, I tried to use time.sleep() and try except Exception in python to catch the Connection Error and resend the POST request if the connection failed. However, this ended up creating a “while True” (infinite) loop – not what I wanted at all.

Then, I attempted deploying a Python Flask Restful API app with Gunicorn, converting my Flask app from a development service into a production service, which required changing much of my code, including my docker-compose file. Gunicorn is able to “communicate with multiple web servers,” “react to many web requests at once and distribute the load,” and “keep multiple processes of the web application running” [2]. However, even with multiple processes running in parallel, the connection error still occurred.

To learn more about Gunicorn, check out this website: https://vsupalov.com/what-is-gunicorn/ [2].

There were other ways I could have approached fixing the connection error, but my mentor and I decided that they would all take too much time and effort. So, I had to sadly abandon the Flask app I had built earlier in my project and instead implement Random Forest as a rule in my pipeline, independent from Docker or Flask. Although the runtime was still extremely long for the Random Forest rule, there was no more connection error!

 

Training Random Forest

Before training my random forest model, there is one more very important step to note – the test-train split. I could use all the data to train my random forest model, but then I would have no way to determine how well my random forest model performs. Furthermore, using 100% of the data to train would likely lead to a model that was overfit to the dataset. Hence, I’ve divided the dataset randomly into two parts: 80% of the humor dataset is used to train my random forest model, and the other 20% is used to test the model and evaluate its performance (accuracy, recall, precision, etc.).

And here is the code from train_randomforest.py:

Line 46 above performs the test-train split, allocating 20% of the data to testing. Line 49 creates the TfidfVectorizer, which is then used to vectorize the training data in line 51. Finally, hundreds of decision trees are created based on the training data in line 55 using the built-in library “RandomForestClassifier” from scikit-learn.

Then, in random_forest.py, the TF-IDF vectorizer fit on the training data is used to transform the testing data into vectors in line 54. Using “clf.predict(X_test)” in line 56, the Random Forest model trained using the training data is utilized to predict if the other 20% of sentences are funny or not!

 

Test Results

Here are the results:

As you can see, as expected, TF-IDF outperformed the Bag of Words model (CountVectorizer), and the parameters “min_df=5” and “max_df=0.7” also helped improve the model. Quite surprisingly, stemming outperformed lemmatizing, but not by much. Overall, I am quite pleased with the results of the Random Forest Classifier. By using majority voting over hundreds of decision trees based on text vectorized by TF-IDF into 1500-dimensional vectors, it was able to achieve an accuracy of 86%. Pretty fascinating, if you think about it.

 

Next Steps

The next step in my project is to explore Fasttext, a much more advanced and sophisticated classification algorithm. Fasttext was created by Facebook’s AI Research Lab, and it’s an open-source text classification library that is widely used. Clarabridge even uses it! I look forward to downloading the fasttext library, exploring it various use cases, and utilizing it to conduct humor detection!

Leave a Reply