Hi everyone! Welcome back to my blog!
This week has been quite eventful. I’ve been able to find a large humor dataset and test my five rules!
In this blog post, I will talk about the following:
- Finding a Dataset
- What Makes a Word “Funny” or “Not Funny?”
- Test Results
- Test Metrics: Accuracy, Precision, Recall, f1, Confusion Matrix
- Why Negator Does Poorly
- Next Steps
Finding a Dataset
Finding a humor dataset containing thousands of funny sentences along with not funny sentences proved more difficult than I had imagined. After much research and digging, I struck gold. I found a GitHub repository containing over 400 thousand sentences, including about 200,000 short jokes from a Kaggle dataset, which contains jokes scraped from Twitter (funnytweeter.com, funtweets.com), Reddit, https://onelinefun.com/, http://thejokecafe.com/, and other websites . The non-humorous sentences consisted of a compilation of sentences of news articles in 2007, gathered using a “WMT162 news crawl that had the same distribution of words and characters as the jokes in the Short Jokes dataset on Kaggle” . As shown below, there are about the same number of funny sentences compared to not funny sentences.
Below, the head and tail of the dataset are shown. The full dataset can be found in this GitHub repository: https://github.com/orionw/RedditHumorDetection/tree/master/full_datasets/short_jokes/data .
This GitHub repository also contains datasets for puns, as well as a dataset of just Reddit jokes. However, puns only make up a small portion of jokes, and the Reddit jokes dataset is of worse quality. The short jokes dataset has potential downsides: many of the jokes are on the shorter side (many one-liners), and its jokes often follow various templates, for instance “what is the difference between _____ and _____?” In addition, it is important to point out that not everything that is funny is a joke, so just training my model on a dataset of short jokes might lead the model to become unable to capture some other instances of when we laugh. Nevertheless, this dataset is much cleaner than the others, it contains more jokes, and the content of the jokes are of a wide variety. Hence, I’ve decided to use the short jokes dataset for my project.
This dataset has been used in various publications on humor detection. I obtained the dataset, web scraper, and GitHub repository are from the paper “Humor Detection: A Transformer Gets the Last Laugh,” by Orion Weller and Kevin Seppi (https://www.aclweb.org/anthology/D19-1372.pdf) .
What Makes a Word “Funny” or “Not Funny?”
How did I decide what words and phrases should be considered funny? In the beginning, I tried to come up with words that I would generally associate with humor off the top of my head. However, I soon realized that many of the words I came up with often appeared in sentences not associated with humor.
Then, I created a python program to find the words (excluding stopwords) that occur the most in the humorous sentences. The 15 most common words in funny sentences are shown below on the left.
However, as you can see, almost all of the most commonly occurring words in funny sentences shouldn’t be considered as “funny” words. “Joke” should definitely count, but “say,” “like,” and “im” shouldn’t. In fact, “say” is the most common word in not funny sentences as well, as shown in the graphic above on the right.
To ensure that my “funny” words were actually funny, I created a python program that found the words that appeared the most in the ‘funny’ sentences and least in the ‘not funny’ sentences using a metric called “percent funny”:
Percent funny = (# of occurrences in funny sentences)/(# of occurrences in funny sentences + # of occurrences in not funny sentences)
The top 40 results are shown below. As you can see, even the majority of these words should not be considered “funny” words. However, this metric does bring to my attention words like “fart,” “punchline,” “pun,” “haha,” “lol,” and others that are generally associated with humor.
So, using this metric with my own intuition, I created humorwords.txt, which contains over 130 “funny words” and is what rule 1 (contain_funnyword) uses to detect humor. This metric also helped identify words that are common in phrases associated with humor. For instance, the word “lightbulb,” “yo,” and “momma” are high on the list, but not because they are “funny” words – it’s because of “yo mama” and “how many _____ does it take to change a lightbulb?” jokes. Thus, instead of adding the words “lightbulb,” “yo,” and “mama” to humorwords.txt, I added the phrase “change a lightbulb” and “yo mama” to humorphrases.txt.
I used this same process to create nothumorwords.txt. Here are the words least associated with humor that my python program found:
The most “not funny” words found by my python program were pretty surprising. Instead of “not funny” words that I thought would have topped the list, such as adjectives like “agony,” “torture,” and “sorrow,” my python program returned mostly nouns, including some proper nouns like “sarkozy.” Although these nouns have an extremely low “percent funny,” most could still easily appear in jokes. Hence, it was much harder to find true “not funny” words.
However, before I could start testing my rules, there are a couple key pre-processing stages for NLP (natural language processing) that were essential to implement.
1. Noise Removal: The line of code ‘sentence = re.sub(‘[^a-zA-Z]’,”,word)‘ only keeps words and removes all noise from the text, including HTML tags (\n, \t, etc.), extra whitespace, punctuation, special characters, etc. This helps clean up the text a lot!
2. Lowercasing: The line of code ‘sentence = sentence.lower()‘ transforms all uppercase letters to lowercase.
3. Tokenization: This transforms a sentence into a list of words.
4. Removing Stop Words: Stop words are common words that appear too often to help in text classification (low predictive power). Examples include “I,” “he,” “to,” “a,” etc. However, since many of the negator words in negators.txt and “funny” phrases in humorphrases.txt contain stop words, removing them would hurt my model’s performance. Although I did not implement this step for my rules-based approach, it is a key step for machine learning models that take the number of appearances of a word into account, including my random forest machine learning model, which I’ll discuss in my next blog post.
5. Normalizing: This pre-processing step converts typos and abbreviations of words to their root form. For instance, “befor” and “b4” would become “before.” I have not yet implemented this step.
6. Stemming or Lemmatizing: Both attempt to transform words to their root form.
Stemming: the more aggressive of the two, stemming chops off prefixes and suffixes, which may result in a non-English words and cause words to lose their meaning. Ex: studied/studies –> studi, eating/eats –> eat, caring –> car
Lemmatizing: the more computationally expensive of the two, lemmatizing transforms versions of a word (ex. studies/studied/study/studying) to one base (ex. study). Unlike stemming, the base is always an English word. Lemmatizing requires part-of-speech tagging in order to be accurate.
Although it seems much better than stemming, its advantage in performance has been found to be small. I’ve found my model’s performance to be slightly better with lemmatizing than stemming, as stemming is a more crude method and has more room for error.
7. Text Enrichment/Augmentation: This includes countless advanced NLP techniques, such as named entity recognition, syntax tree manipulation, phase extraction, word replacement, query expansion, etc.
As a reminder, these are my five rules:
1. Contain_funnyword: classifies sentences as funny if it contains a “funny” word or phrase, a word or phrase which is generally associated with humor.
2. Contain_notfunnyword: classifies sentences as not funny if it contains a “not funny” word, words generally not associated with humor.
3. Contain_funnyword_negator: if there is a “negator” word in front of a “funny” word, outputs the opposite result of rule 1.
4. Contain_notfunnyword_negator: if there is a “negator” word in front of a “not funny” word, outputs the opposite result of rule 2.
5. Weighted average: takes average of first four rules
And here are the results!
As I expected, rule 1 (contain_funny) performed the best on the humor dataset of over 400 thousands sentences. With lemmatizing, rule 1 classified 64% of sentences correctly. Its precision was 83% (not bad!), but its recall was 36%. But what do these metrics mean?
To best understand, let’s look at the confusion matrix for rule 1 (contain_funny) with lemmatizing.
The confusion matrix concisely summarizes all possible results.
The diagonals in confusion matrixes are correct classifications:
-72,144 funny sentences were correctly classified as funny (TP).
-188,080 not funny sentences were correctly classified by rule1 as not funny (TN).
The top right box and bottom left box are incorrect classifications:
-In the top right box are FN (False Negatives): 130,548 funny sentences were erroneously classified as not funny.
-In the bottom left box are FP (False Positives): 14,628 not funny sentences were erroneously classified as funny.
Accuracy = (TP+TN) / (TP+TN+FP+FN). It’s the most commonly used performance metric, but it doesn’t tell the whole story, including if there are a lot of FN or FP. Precision, Recall, and f1 are better metrics.
Precision = TP/(TP+FP). Precision tells you what proportion of positive identifications was correct.
Recall = TP/(TP+FN). Recall tells you what proportion of actual positives was identified correctly.
f1 = (2*precision*recall)/(precision+recall). The f1 score is a weighted average of precision and recall.
A high precision and recall are both crucial. However, for my humor detection project, I am more concerned with precision than recall. Even though rule2 has an insanely high recall (99%), that’s only because it classifies almost everything as funny, which makes the rule essentially useless. In fact, if we look at its accuracy, its performance (in terms of accuracy) is the same as a coin flip.
Since most sentences are not funny, I feel it’s more important for my model to have a high proportion of positive identifications correct (high precision) and miss a few funny sentences (lower recall) than classify most sentences as funny and have a low proportion of positive identifications correct (high recall, low precision).
I also want to mention that the high precision for rule 1 (contain_funny) is likely an overestimate, as my humor words and phrases were overfit to the humor dataset. Still, it’s not bad!
Why Negator Does Poorly
Surprisingly, rules 3 & 4 (contain_funny_negator and contain_notfunny_negator, respectively) did quite poorly. To understand why, let’s look at the sentences below:
-Why can not the pirate get to any subreddits he keeps typing arrr funny
-It was not just a tragedy it was one of the worst tragedies ever not funny
The first sentence contains a “funny” word (“pirate”) and a negator in front, so contain_funny_negator would classify the sentence as not funny. Similarly, in the second sentence, there is a “not funny” word (“tragedy”) and a negator in front, so contain_notfunny_negator would incorrectly classify the sentence as funny.
In sentiment analysis, negators have an enormous impact. Adding a “not” in front of a word like “mad” completely changes the sentiment of the sentence. However, as you can tell, negators in non-humorous sentences don’t have much impact on the “funniness” of the sentence. In funny sentences, negators are often present – sometimes they barely have any impact (as in the example above), and other times they help create contrasts that make us laugh (per the incongruity theory)! Hence, the rules with negators do not perform well.
Analyzing my results, I can conclude that the negator rules (rule 3 & 4) do quite poorly, as well as rule 2 (contain_notfunny), for there are few words that are only present in non-humorous sentences. Rule 1 (contain_funny) does the best. Hence, I plan on putting a hold on the other rules and working on improving rule 1.
Now that I’ve implemented a rules-based approach to humor detection, next week I’ll begin to implement an AI approach to humor classification. Hopefully I will be able to train a machine learning (random forest) model to classify humor! I also look forward to comparing my results to the results from the rules-based approach to see which does better (hopefully the machine learning model!).
Until next time! Stay safe everyone!