Week 10: Using my Rules Together – a Success!

May 29, 2020

Intro

Hi everyone! Welcome back to my blog!

After three months of hard work, my senior research project has come to an end. In this blog, I will share the pipeline architecture I built for this project, as well as my final results.

  1. Pipeline Architecture
  2. Using My Rules Together – Results!
  3. What I’ve Learned
  4. Future Work
  5. Next Steps

 

Pipeline Architecture

One of the main goals for this project was to design a flexible architecture/system design that could implement a train/test/predict pipeline for humor detection. I feel my pipeline has fulfilled this goal. Below is a flowchart showing my pipeline’s architecture:

As you can see, there are three major steps in my pipeline: (1) preprocessing, (2) rules, and (3) postprocessing. I have also implemented two modes: an interactive mode, where a user can input any sentence he likes and the program will return its prediction whether the inputted sentence is funny or not, and a mode where a user can input a humor dataset for my model to train and test on. The second mode is more useful, but the interactive mode allows me to demo my rules live with any sentence and highlight the strengths and weaknesses of each rule.

1. Preprocessing

Whether the text is user-inputted or read from a text file containing thousands of sentences, the data first passes through the preprocessing handler. First, everything but text is removed (noise removal) and each word is lowercased. These two steps are considered “essential preprocessing” stages and always occur. The user then can specify whether they want to enable other preprocessing stages, including stopwords removal, stemming, and lemmatizing. Enabling and disabling preprocessing stages is simple – all the user has to do is change “enabled” to “True” if they want the text to go through a specific preprocessing stage, or turn “enabled” to “False” if not. The user does this in the config dictionary (shown below), where the user specifies properties and metadata. The inner workings of the model and complexity of the training in the backend is completely hidden away from the user for the user’s convenience.

There are two additional optimal preprocessing stages: (1) train_randomforest and (2) train_fasttext. These only work when interactive mode is set to “False,” and they train new humor detection Random Forest Classifiers and fasttext models, respectively, with the hyperparameters specified by the user as metadata. Anyone can easily tune these hyperparameters by changing what they are equal to.

2. Rules

After the text is preprocessed and new models are trained, the the text goes through the rules handler. Again, as shown above, the user can easily change which rules they want to test by switching “enabled” to “True” and “False.” They can also specify how heavily they want each rule to be weighted. Right now, since Fasttext returns the highest accuracy, right now Fasttext has the highest weight: the weight of Contain_Funny above is 0.25, the weight of the Random Forest Classifier is 0.25, and the weight of Fasttext is 0.5. So, if Random Forest thinks a sentence is funny but Contain_Funny and Fasttext classifies the sentence as not funny, the model will predict the sentence is not funny. However, if Fasttext predicts a sentence to be funny. but Contain_Funny and Random Forest don’t, the sentence will be classified as humorous.

For Contain_Funny, the user can specify additional metadata, including if they just want to use “funny” words, “funny” phrases, or both to detect humor. If they set “strict” to True for “humorphrases,” the model will use fewer of the “funny” phrases (only ones with lengths 4 & 5). I will explain why I created this in the next section.

For the Random Forest and Fasttext rules, the user can also specify the metadata of modelID, which tells my program which Random Forest or Fasttext model to test on. For instance, if a Fasttext model has already been trained and has a modelID of 1, and there is no need to train the same model again, the user can skip the train_fasttext preprocessing stage and directly use the pre-trained fasttext model by setting modelID to 1. If a modelID is not specified, the model that was just trained will be used to predict whether each sentence in the testing dataset is funny or not.

3. Postprocessing

For each rule, the predictions are returned and fed into the postprocessing handler, where the user can decide what performance metrics they want to use, if they want to save the results, and if they want to create pretty confusion matrices for the results for each rule. If my program is set to interactive mode, the user will be able to type in another sentence, and the pipeline will start over again.

Some key features of my pipeline:

1. Every aspect of the pipeline is configurable (able to easily enable/disable stages/rules for experimentation), extensible (able to easily add new stages/rules with ease), and composable (able to specify dependencies and execute stages in the correct order). There are abstract base classes for preprocessing, rules, and postprocessing, and all stages inherit the methods and attributes from the abstract base classes, ensuring that they follow the same format and making adding new stages even easier.

2. The pipeline design I have built for my humor detection project is re-usable. This same pipeline can be easily ported and utilized for other projects (including other text-classification services/exploratory machine learning projects like sentiment analysis, topic modeling, emotion recognition, etc.)

3. My pipeline is extremely modular. This makes it easy to parallelize the execution for faster run-times (though this has not been implemented yet). Each preprocessing stage, rule, and postprocessing stage is its own module, making the code comprehensible and extremely extensible.

 

Using My Rules Together – Results!

Fasttext by itself was able to achieve an insane accuracy of 95.6%. But I was curious: If I used the rules Contain_Funny, the Random Forest Classifier, and Fasttext together, could I achieve a higher accuracy?

Below are the results of using the three classifiers together when they were weighted equally (33-33-33). As you can see, Fasttext by itself outperformed the rules working together (average) in every metric – accuracy, precision, recall, and f1.

Trying different weights (25-25-50), including increasing the weight for Fasttext and decreasing the weight for the other two, also failed to return a model that outperformed Fasttext by itself.

As shown above, you can see that Fasttext outperforms Random Forest in every metric. Hence, Random Forest was strictly inferior to Fasttext. But I noticed that Contain_Funny had a higher precision than Fasttext, as most of the sentences Contain_Funny classified as funny were actually funny. I tried to take advantage of Contain_Funny’s high precision, weighing Contain_Funny and Fasttext 50-50. But Fasttext still outperformed the average of Contain_Funny and Fasttext.

Although the above combinations did not work, I was not deterred. I thought, “How could I take advantage of Contain_Funny’s high precision?” And the answer came to me – increase its precision even more. Hence, I created a stricter Contain_Funny, which only took into account the “funniest” humor phrases that had a length of only 4 & 5. This decreased the number of false positives from 793 to 31, increasing precision from 96.5% to 99.7%!

Now, using Contain_Funny and Fasttext together, weighed 50-50, the resulting model (average) outperformed Fasttext. Accuracy increased by only 0.00016, but still!

Why did this work? For thirty sentences, Fasttext mistakenly classifies it as not funny, but there are “funny” phrases in these thirty jokes, so the average of these two classified these sentences correctly as “funny.” Hence, the number of true positives increased, and so did accuracy! However, how little effect Contain_Funny had on Fasttext, even when Contain_Funny’s precision was near 100%, shows how powerful and accurate Fasttext is by itself.

 

What I’ve Learned

I’ve learned so much during these past few months interning at Clarabridge. In fact, I think it is safe to say that I’ve learned more about computer science, computer engineering, and machine learning in these past three months than the past four years! Below are some of the applications/libraries/software I’ve had the opportunity to use and explore:

 

Future Work

Although the results of my humor detection project have exceeded my wildest dreams, there are always avenues I could explore to improve my model. As many of my Clarabridge colleagues pointed out to me during my final presentation, the model is only as good as the model it is trained on. Since my funny sentences are mostly short jokes from Twitter, Reddit, and other sources, my model might not be able to pick up more nuanced instances of humor. As one colleague pointed out, Clarabridge doesn’t come across many “yo mama” or “knock knock” jokes – most instances of humor are not in forms that are common in my short jokes dataset. Furthermore, there could be false negatives in the 200,000 non-humorous sentences compiled with a news crawl, for some of those sentences could actually be funny. Thus, a larger, more diverse, and more accurately labeled dataset (potentially one from a company using Clarabridge’s services) could potentially improve the performance of my model. Exploring other advanced machine learning text classification libraries like GloVe, BERT, and PyCaret, as well as additional hyperparameter tuning, are also possible extensions and areas of future work.

 

Next Steps

The last step in my project is to finish up my presentation and record my formal presentation, detailing my senior research project experience as a whole and what I’ve learned. This presentation, along with a link to my code, will be in my next blog post. Stay tuned!

2 Replies to “Week 10: Using my Rules Together – a Success!”

  1. Charles T. says:

    This is awesome! Could you discuss a little about the differences between accuracy and precision when it comes to this program?

    1. Ethan H. says:

      Thanks, Charles!

      Accuracy, precision, recall, and f1 are all different performance metrics that capture different things.

      Accuracy = (TP+TN) / (TP+TN+FP+FN). It’s the most commonly used performance metric, but it doesn’t tell the whole story, including if there are a lot of FN (false negatives) or FP (false positives). Precision, Recall, and f1 are better metrics.

      Precision = TP/(TP+FP). Precision tells you what proportion of positive identifications was correct.

      Recall = TP/(TP+FN). Recall tells you what proportion of actual positives was identified correctly.

      f1 = (2*precision*recall)/(precision+recall). The f1 score is a weighted average of precision and recall.

      For Contain_Funny, while accuracy is around 76%, the number of false positives is much less than the number of false negatives (FP<<FN), so precision is much higher than recall (precision is about 96%, while recall is about 54%).

Leave a Reply