Week 9: Training & Testing a Fasttext Humor Detection Model

May 27, 2020

Intro

Hi everyone! Welcome back to my blog!

This week, I’ve been able to train a Fasttext model to classify humor!

In this blog post, I will talk about the following:

  1. The Power of Fasttext
  2. How Does Fasttext Work?
  3. Preprocessing
  4. Training Fasttext
  5. Test Results
  6. Hyperparameter Tuning
  7. Next Steps

 

The Power of Fasttext

Fasttext is an extremely powerful and sophisticated open-source AI library developed by Facebook’s AI Research lab. It has recently become increasingly popular. Fasttext contains pretrained models in 294 languages, many of which are used by Clarabridge, the text analytics company I am performing my senior research project with.

Fasttext has numerous supervised machine learning and unsupervised machine learning use cases. I am using it for humor detection, but it can also be used for any other text classification task (supervised learning), including sentiment analysis, spam detection, topic modeling, and much more. For each of these supervised learning tasks, the training dataset is labelled (ex. funny v. not funny, happy v. sad, spam v. ham, etc), and the Fasttext model is able to learn from the data and labels. The trained Fasttext model can then accurately classify new data.

Fasttext also has various fascinating unsupervised machine learning applications. Unsupervised machine learning is when the training dataset only contains data and no labels. One such application is that a trained Fasttext model can recognize similar words. By using the dataset of the “first 1 billion bytes of English Wikipedia” [1], the Fasttext model was able to train itself through unsupervised learning to understand the complex relationship between words. Take for instance the word “throne” – pretty amazingly, Fasttext is able to understand that words like “enthrone,” “abdicate,” and “heir” are related to the word “throne.”

It is also able to differentiate between apples (fruit) and Apple (tech giant)!

Perhaps most impressive of all is Fasttext’s ability to understand analogies. As shown below, Fasttext is able to understand the relationship between Beijing and China (capital – country), as well as queen:woman::man:king. It’s pretty scary how smart it is.

If you want to try it out on your own, follow these tutorials: https://fasttext.cc/docs/en/unsupervised-tutorial.html, https://fasttext.cc/docs/en/supervised-tutorial.html [1].

 

How Does Fasttext Work?

One of the biggest differences between Fasttext and other text classification libraries is that Fasttext represents each word as an n-gram of characters. For example, the word “where” would be represented by Fasttext as [<wh, whe, her, ere, re>] if n is 3. “A vector representation is associated to each character n-gram; words being represented as the sum of these representations” [2].

This at first seems silly, but in reality this representation is quite powerful. Breaking each word down into n-grams and using the n-grams to train the Fasttext model allows the model to recognize common roots of words. This is especially advantageous with regards to rare words and words not present in the dataset (out-of-vocabulary words).

For instance, let’s consider the scenario where the word “smartest” never appeared in the dataset, but the words “smart,” “smarter,” “smarts,” and “outsmart” did. Since they all share very similar n-gram of characters, Fasttext would able to recognize that the words had similar meanings. And since the words “smart,” “smarter,” “smarts,” and “outsmart” are synonyms of words like “intelligent” and antonyms of words like “dumb,” the word “smartest,” which the model has never seen before, will also be synonyms with “intelligent” and antonyms with “dumb.” Hence, with Fasttext, traditional essential (and computationally expensive) preprocessing stages like stemming and lemmatizing, which cut off prefixes/suffixes and transform words into their root form respectively, are no longer needed.

But there are even more benefits. Since there are less combinations of character n-grams than word n-grams, the vectorized text is less sparse. Moreover, typos are no longer an issue! Mistakenly spelling Massachusetts as “Masachoosets,” making up words like “fabulouslyfantastic,” or emphasizing how delicious lunch was with “yummm” would confuse GloVe, Word2Vec, the Bag of Words model (and thus TF-IDF), and other models. However, by breaking each word down into character n-grams, the Fasttext model is able to make a pretty good educated guess as to what the word means. Pretty amazing!

The fact that Fasttext represents each word as an n-gram of characters makes it unique and advantageous, but it doesn’t fully explain how Fasttext turns words into vectors. Once each word is transformed into a list of n-gram of characters, Fasttext uses Word2Vec – the Continuous Bag-of-Words model (CBOW) or the Continuous Skip-gram model – to perform text vectorization. Each of these methods are significantly more complex than the Bag of Words model, TF-IDF, and other methods of text vectorization, as they each use neural networks, which have multiple layers, activation functions, and are quite difficult to understand.

Word2Vec, one of the most successful NLP models, is built off of a pretty intuitive and simple idea: context is important. As J.R. Firth once said, “you shall know a word by the company it keeps.” CBOW uses context to predict the word in the middle, while Skip-gram uses the middle word to predict context. The idea being, since similar words have similar context, the text vectorization of similar words using Word2Vec will be similar, while the text vectorization of unrelated words using Word2Vec will be completely different. Here’s a simplified version of how they work:

Below is a slightly more sophisticated of representation of how CBOW and Skip-gram train the neural network. For CBOW, based on a window of context words, each of which is a vector, the vector representation of the center/target word is predicted. As for Skip-gram, based on the vectorized center/target word, the vector representation of the window of context words is predicted.

For instance, let’s consider the sentence: “When the snow falls and the white winds blow, the lone wolf dies, but the pack survives” #GameOfThrones. Using CBOW, to transform the word “white” into a vector, the model would use the words around “white,” such as “snow falls” and “winds blow,” to train the neural network and predict the vector representation of “white.” And using Skip-gram, the word “white” would be used to predict the vector representation of neighboring words like “snow” and “wind.”

Both CBOW and Skip-gram are incredibly powerful and return impressive results. While CBOW trains faster and represents common words better, Skip-gram generally represents rare words better. This is due to the fact that CBOW uses context to predict the most common word in the middle (and thus better accuracy for common words), while Skip-gram trains each word, including rarer ones, with its context, hence more computationally expensive but better performance for rarer words. Since there are more rare words than common words, and since runtime is not that large of an issue for Fasttext, I will be using Skip-gram for my humor detection project.

After the text is vectorized using Word2Vec (either CBOW or Skip-gram), each word is its own vector in n-dimensional space. If simplified to 2-D space, you can see that similar words are closer together, while unrelated words are farther apart.

Furthermore, words that have a similar relationship to one another are often a similar distance apart. This is how Fasttext is able to understand analogies.

Using the vectorized text, Fasttext then uses multinomial logistic regression (softmax regression) or negative sampling to perform text classification, in my case humor detection. These two methods are also extremely complex and beyond the scope of this blog. For my project, I will use hierarchical softmax regression, which is a simplified form of softmax regression. Hierarchical softmax regression approximates the softmax, thus being much more efficient and having a faster computation, while also being able to maintain an extraordinarily high performance. As explained on the Fasttext website, “The idea [of hierarchical softmax] is to build a binary tree whose leaves correspond to the labels. Each intermediate node has a binary decision activation (e.g. sigmoid) that is trained, and predicts if we should go to the left or to the right. The probability of the output unit is then given by the product of the probabilities of intermediate nodes along the path from the root to the output unit leave.” [1].

Published Paper on Fasttext: https://arxiv.org/pdf/1607.04606.pdf [2].

 

Preprocessing

Since Fasttext represents each word as an n-gram of characters, no stemming or lemmatizing was required. However, preprocessing steps, such as noise removal & lowercasing, were still essential. Curiously, removing stop words, stemming, and lemmatizing all hurt the performance of Fasttext – I’ll discuss why in the “Results” section of this blog.

One more preprocessing step had to be done: the word “label” had to be added in front of each sentence, which I accomplished with a short Python script.

And now the humor dataset is ready to be fed into the Fasttext model!

Funny enough, while testing my trained Fasttext model, it kept returning a precision of around 6% – abysmal! After many hours debugging, I realized that there had to be a space between “__label__” and each sentence, not a comma. Whoops!

 

Training Fasttext

Training the Fasttext model was easier than expected. After downloading the Fasttext library, all I had to do was specify a couple hyperparameters, and voila – a Fasttext model was trained! Amazingly, training was exceptionally quick. While training the Random Forest rule took hours upon end, training Fasttext with hierarchical softmax was done training 20 epochs in less than 2 minutes.

Here is my code for training a fasttext model and then using it to predict if a sentence was funny or not:

The code above is from train_fasttext.py. Lines 53-56 read the values of the hyperparameters specified by the user. They are then fed into the fasttext model to train in line 64. The model is then saved in line 66.

Above is my code from fast_text.py. First, the corresponding trained model is loaded. Then I loop through the testing data, using the trained model to predict if sentences are funny or not. The list of predictions is then returned and sent to the postprocessing stages to determine the model’s accuracy, precision, recall, and other performance metrics.

 

Test Results

And here are the results of Fasttext:

Absolutely amazing! Without removing stopwords, lemmatizing, or stemming, Fasttext was able to achieve an accuracy of 95.6%. Below are the confusion matrices (normalized and without normalization) for the Fasttext model without stopwords removal, lemmatizing, or stemming:

38,706 true positives, 38,814 true negatives, and only 1,683 false positives and 1,877 false negatives. Truly incredible.

One insight and anomaly I noticed is that removing stopwords, lemmatizing, and stemming all hurt Fasttext’s performance, as shown in the table above. Why is this the case?

Removing stopwords is generally considered a pretty essential preprocessing stage. Stop words are common words that are presumed to appear too often to help in text classification (low predictive power). Examples include “I,” “he,” “to,” “a,” etc. If the text is vectorized with models like the bag-of-words model and TF-IDF, which take into account how often each word appears, and stopwords are not removed, machine learning models like the Random Forest Classifier will then erroneously think these words have high predictive power and use them to determine whether a sentence is funny or not, when in reality they commonly occur in both funny and not funny sentences.

However, removing stopwords can be dangerous. The danger of removing stopwords is most apparent in sentiment analysis. Take the following two sentences for instance:

1. I love the actress Emilia Clarke.

2. I do not love the actress Emilia Clarke.

The two sentences above convey polar opposite sentiments. However, after removing stopwords, both sentences are reduced to the same fragment: “love actress Emilia Clarke.” The words “I,” “do,” “not,” and “the” are all considered stopwords, and removing them, especially the word “not,” completely changes the meaning of the second sentence. Nevertheless, as I’ve noted in earlier blogs, negator words like “not” do not have as large an impact on humor as they do on sentiment. In fact, I’ve found their impact to be negligible.

But removing stopwords has other drawbacks, one of the larger ones being that it often strips away important context. For example, many jokes follow the format “what do you call X?” However, when you remove the stopwords, all that is left is the word “call” along with the phrase “X”. The word “call” isn’t really a “funny” word – it can often appear in non-humorous sentences, such as “give me a call” and “don’t call me a bastard.” However, when the word “call” appears in the phrase “what do you call,” it is most likely a joke and thus funny. And since Fasttext relies so heavily on context, stripping that valuable context away hurts the performance of the model. For humor detection, removing stopwords decreased the accuracy of the Fasttext model by 2%!

Lemmatizing and stemming also both hurt Fasttext’s performance, though not as noticeably as removing stopwords did. Stemming is the more crude of the two – chopping off prefixes and suffixes with stemming can lead to over-stemming and cause words to lose their meaning (ex. lied –> li). Lemmatizing, however, transforms words into their root form and always returns a real word. Why does it have a detrimental effect? The answer is that although lemmatizing is a little better than stemming, it can also make mistakes. I came across a surprising mistake lemmatizing made when I came across the headline “Five-year-old stopped on U.S. highway wanted to buy Lamborghini, police say” (Reuters).

The word “U.S.,” when lowercased and noise was removed, became the word “us.” The wordnet lemmatizer recognized “us” to be a proper noun, and since “us” ended in “s,” “U.S.” became “u”. Although edge cases like this are not common, mistakes like this one still occur and explain the drop in performance of Fasttext when stemming or lemmatizing was employed. The fact that stemming and lemmatizing both hurt Fasttext’s performance also highlights the power of Fasttext and character n-grams!

As a sidenote, while the rule Contain_Funny and the Random Forest Classifier classified the headline “Five-year-old stopped on U.S. highway wanted to buy Lamborghini, police say” as not funny, Fasttext was somehow able to recognize this sentence as funny. Insane!

 

Hyperparameter Tuning

After playing around with the hyperparameters, I ended up setting the learning rate to 0.75, the number of epochs to 20, wordNgrams to 2, and the loss to hierarchical softmax (as discussed in the “How does Fasttext Work?” section). The learning rate can be set to any number between 0 and 1. Setting it to 0 means that the model will not learn, and setting it to 1 might cause it to learn too quickly and pass the optimal point where loss is minimized and performance metrics like accuracy are maximized. Thus, a higher learning rate is riskier but can cause the model to train faster and reach the optimal point faster. The number of epochs specifies how many generations the Fasttext model is trained. Setting the number of epochs to 1 might lead to an underfit model that hasn’t learned much, but setting it to 100 might cause to model to become overfit to the training dataset. Below is a good example of underfitting and overfitting with linear regression:

I found setting the learning rate to 0.75 and the number of epochs to 20 to be happy mediums, resulting in neither underfitting or overfitting. I set wordNgrams to 2, which meant that Fasttext considered words in groups of 2. Increasing wordNgrams from 1 to 2 helped the performance of Fasttext, but increasing wordNgrams to 3 or 4 hurt the performance. At first this confused me, but it made sense – I was overfitting the Fasttext model to the dataset.

Learning rate, number of epochs, wordNgrams, and loss are some of the more important hyperparameters to tune, for changing them can have a large impact on the results of your Fasttext model. I should mention, though, that there are much more hyperparameters that I could have played around with. The full list (from the Fasttext website) is shown below:

 

Next Steps

Fasttext’s results are truly incredible. When starting this project, I never could have dreamed of achieving an accuracy of 95.6% for humor detection. The next step in my project is to try to use my rules together (Contain_funny, Random Forest, and Fasttext) to see if I can achieve an even better model and even higher accuracy. Stay tuned for next week’s blog post!

Leave a Reply