Week 7: Improving Rule 1 (Contain_Funny) and Exploring Text Vectorization

May 24, 2020

Intro

Hi everyone! Welcome back to my blog!

In this blog post, I will talk about the following:

  1. Improving Humor Phrases
  2. Rules Results
  3. Text Vectorization
    1. One-hot Encoding
    2. Bag of Words (Count Vectorizer)
    3. Term Frequency – Inverse Document Frequency (TF-IDF)
  4. Next Steps

 

Improving Humor Phrases

While examining the results of rule 1 (contain_funny), which classified a sentence as funny if it contained a “funny” word or phrase, I realized that my humor phrases’ performance was quite poor.

Out of the 14,628 false positives (sentences predicted as funny but were actually not funny) for contain_funny, 2,089 were because of “funny” words, but 12,539 were because of “funny” phrases! This was particularly concerning for multiple reasons:

(1) Intuitively, humorphrases should do as well as or better than humorwords. If a sentence contains a “funny phrase,” it should almost always actually be funny. There should not be that many false positives, definitely not 6x more false positives than humorwords.

(2) The main goal of rule 1, contain_funny, is to achieve an extremely high precision, which is achieved by minimizing false positives. A high precision is essential, for if machine learning models mistakenly classify a sentence as not funny, contain_funny would be able to recognize the funny word/phrase and correctly classify it as funny. A low precision defeats the purpose of the rule.

Hence, I set out to improve the performance of my “funny” phrases. Instead of coming up with funny phrases off the top of my head and hoping they would have a high precision, I decided to approach the problem of finding “funny” phrases empirically. Similar to how I eventually found the words that were the funniest by using the metric “percent funny,” I used “ngrams” from the library “nltk” to find the phrases length 2-5 that occurred the highest percent of the time in funny sentences as opposed to not funny ones. The results are below:

 

Even though all of the phrases above only occurred in funny sentences (percent funny = 1), the longer phrases were more likely to have a higher precision. Even though some funny phrases of length 2 almost always appear in jokes, including “knock knock” and “yo mama,” others like “my cat” and “call someone” that also have a percent funny of 1 could easily appear in non-humorous sentences. On the other hand, there is a much smaller chance that the phrases of length 4 & 5 shown above will appear in sentences that are not funny. Due to this realization, I included more longer “funny” phrases and less “funny” phrases of length 2 & 3 in my humorphrases.txt.

In summary, I improved my list of funny phrases from around 70 phrases that I thought of to around 3,000 funny phrases (~20 funny phrases of length 2, ~800 funny phrases of length 3, ~1,000 funny phrases of length 4, and ~1,000 funny phrases of length 5) found empirically.

 

Rules Results

As a reminder, these were the results before I had improved the humor phrases:

And these are the improved results:

Accuracy, precision, recall, f1 – my improved “funny” phrases caused Contain_Funny (rule 1) to do better in all of these performance metrics!

Here are the confusion matrices for contain_funny (rule 1) with lemmatizing, before and after side-by-side (before on the left, after on the right).

As you can see, the number of false positives has drastically decreased, and the number of true positives has increased as well, yielding a much higher precision – exactly what I was aiming for. In addition, the number of true negatives increased, and the number of false negatives decreased. Yay!

 

Text Vectorization

An accuracy of 77% without any machine learning is pretty amazing. Yet, I was excited to see how well a machine learning model could do!

However, before I could implement a machine learning model, there was one essential pre-processing task that had to occur: text vectorization. Machine learning models like KNN (K Nearest Neighbors), Naive Bayes Classifier, and Random Forest Classifier are unable to understand words – they understand numbers. I had to somehow transform all the words in each sentence into vectors, numbers computers could comprehend and utilize to train machine learning models.

There are numerous methods of text vectorization. I will discuss three of the more simpler methods of text vectorization.

  1. One-hot Encoding
  2. Bag of Words (Count Vectorizer)
  3. Term Frequency – Inverse Document Frequency (TF-IDF)

I will discuss other more advanced methods of text vectorization, including Word2Vec (the skip-gram model and continuous bag-of-words model) in later blog posts!

 

One-Hot Encoding

One-hot encoding is arguably the simplest method of text vectorization.

Consider the sentences: “the dog jumped on the table” and “football is the best sport”

The entire set of words based on these two sentences are “the,” “dog,” “jumped,” “on,” “table,” “football,” “is,” “best,” and “sport.” Using one-hot encoding, if a word is present, it will be encoded as 1. If a word is not present, it is encoded as 0.

For the first sentence, “the,” “dog,” “jumped,” “on,” and “table” would be represented as 1, while “football,” “is,” “best,” and “sport” would be represented as 0. And for the second sentence, “football,” “is,” “the,” “best,” and “sport”  would be represented as 1, while “dog,” “jumped,” “on,” and “table” would be represented as 0.

Simple, but not that powerful or useful.

 

Bag of Words Model

The Bag of Words model is a slight improvement over one-hot encoding. Instead of representing each word as 0 or 1 based on if it is present in a sentence, it instead represents each word as the number of times it appears in the sentence.

Below is an example of the bag of words model at work:

And below is the bag of words model at work for the first five sentences in the stopwords-less & lemmatized humor dataset of over 400,000 sentences:

What is the purpose of the bag of words model? Its purpose is to capture the words that have the highest predictive power. For instance, the computer will be able to recognize that the word “knock” is a relatively “funny” word, due to the fact that through the bag-of-words model, “knock” will appear twice in many of the funny sentences and thus be encoded as 2, while at the same time it’ll be encoded as 0 for most of the non-humorous sentences. Based on how many times each word occurs, machine learning models can be trained to recognize which words are funny and which are not.

However, as you can imagine, there are numerous downsides to this method of text vectorization. I feel four drawbacks are the largest:

  1. Stopwords
  2. Other Words with Low Predictive Power
  3. Sparse Vectors
  4. Context is Ignored

 

1. Stopwords

First and foremost, stopwords are a problem. As a reminder, stopwords are common words that appear too often to help in text classification (low predictive power). Examples include “I,” “he,” “to,” “a,” etc. The bag of words model will mistakenly think that stopwords are important, since they are present so often.

However, this problem can easily be resolved. A bag of words model can be created with CountVectorizer, a library from scikit-learn. It is traditionally created with the line:

vectorizer = CountVectorizer()

However, you can easily remove stopwords by adding it as a parameter.

vectorizer = CountVectorizer(min_df=5, max_df=.8, max_features=1500, stop_words = ‘english’)

You can also specify other parameters, as shown above, including “max_features,” which tells the bag of words model to only consider the 1500 most common words. You can also tell the vectorizer to only consider words that appear in at least 5 sentences with “min_df”, along with only taking into account words that appear in less than 80 percent of sentences with “max_df.” These parameters can be extremely powerful and can potentially drastically improve the performance of machine learning models.

 

2. Other Words with Low Predictive Power

Another large problem with the bag of words model is that there are countless other words that are not included in the stopwords library that have low predictive power. For instance, the most common word in both funny sentences and not funny sentences is the word “say.”

“Say” is not in the stopwords library, but it also has a low predictive power. The parameter “max_df” also wouldn’t work in eliminating it from consideration, as the humor dataset contains over 400,000 sentences, and the word “say” only occurs roughly 60,000 times, and “max_df” is usually set to 0.8 or higher.

Hence, words like “say” and “make” will have a large influence in determining whether a sentence is funny or not, when they are actually words that occur often in both funny and not funny sentences, thus having low predictive power.

 

3. Sparse Vectors

Here is the bag of words model shown again for the first five sentences in the stopwords-less & lemmatized humor dataset of over 400,000 sentences:

Something stand out? Each vector for each sentence contains hundreds of 0’s. Since the CountVectorizer() takes the 1500 most common words into account, and each joke/non-humorous group of sentences only contains around ten to thirty words, the rest will be all zeros. This results a ton of more memory being used than necessary, causing the training of machine learning models to be extremely computationally expensive.

 

4. Context is Ignored

This might be the largest issue with the bag of words model out of the four I have identified. By just taking into account how often each word occurs, the context for each sentence is pretty much completely ignored. This is quite unfortunate, as countless jokes are dependent on context.

This problem, however, can be slightly fixed with a parameter for CountVectorizer called “ngram_range.” I used this same concept to find the phrases that were the funniest! Below, the results of CountVectorizer were shown when I set “ngram_range” to [3,3].

This does sort-of fix the problem of context. However, it will only understand the connection between words if the phrasing is exactly the same. Furthermore, increasing ngrams only makes each sentence’s vectors even more sparse. Thus, this is nowhere near a perfect solution – context is still extremely difficult for the bag of words model to fully capture and understand.

 

TF-IDF (Term Frequency – Inverse Document Frequency) Model

TF-IDF is built off of the bag of words model. It is not able to fix the problems the bag of words model faces with sparse vectors and a lack of understanding for context. Nevertheless, it is able to fix the second problem the bag of words model faces – words that are not stopwords that have a low predictive power.

TF-IDF takes into account both term frequency, which is what the bag of words model is solely based on, and inverse document frequency. Here is the equation for TF-IDF (wi,j = TF-IDFi,j):

Looks complicated, right? In fact, the equation scikit-learn uses is even a bit more complex:

TF-IDFi,j = tfi,j x (ln((N+1)/(dfi+1))+1)

But the specifics aren’t really that important. It’s better to look at the big picture. Essentially what TF-IDF does is it attempts to highlight words that occur often, like how the bag of words model does, but only words that don’t appear across all of the documents.

Let’s take the word “say,” for example. “Say” has a high term frequency (tf), but the word “say” appears in a lot of sentences, both funny and not funny ones, so it would have a high document frequency (df). Thus, log(N/dfi) would be lower, and TF-IDF would return a smaller number. Hence, words that appear in a lot of sentences in the dataset would be downscaled in importance, solving the problem that words with low predictive power posed.

Here is TF-IDF at work for the three example sentences:

As you can see, words that appear often in a sentence, but don’t appear across all the sentences, such as the word “knock,” have the highest TF-IDF, as they are generally the words with the highest predictive power. On the other hand, the word “art,” which also appears twice in the knock-knock joke, has a lower TF-IDF because it appears in other sentences as well (ex. the last sentence). Furthermore, words like “the,” “is,” and “like,” which appear often across all the sentences, have a lower TF-IDF, as they are not as important in determining whether a sentence should be classified as funny or not.

Below is the TF-IDF model shown again for the first five sentences in the stopwords-less & lemmatized humor dataset of over 400,000 sentences:

Since TF-IDF is better than one-hot encoding or the bag of words model, I will be using it to vectorize the sentences in my humor dataset.

 

Next Steps

The next step for my project is to use the TF-IDF vectorized humor dataset to train a machine learning model. Since I had already implemented a Random Forest Classifier to perform fruit classification earlier in my project, I will attempt to use that same Flask application to perform humor detection. Wish me luck!

3 Replies to “Week 7: Improving Rule 1 (Contain_Funny) and Exploring Text Vectorization”

  1. Shang Z. says:

    Ethan your blogs are amazing, good luck on training your machine.

    1. Ethan H. says:

      Thank you Tina!

Leave a Reply