Week 2: Creating the Architecture for My Project

Mar 29, 2020

Intro

Hi everyone. Welcome back to my blog!

This week, Clarabridge instituted a work from home policy, so I’ve been working in the comfort of my bedroom. Although working from home has been nice, I really miss buzzing into Clarabridge’s spacious headquarters, munching on their free snacks, and drinking as much soda as I want. But, most of all, I miss chatting with and learning from all the incredibly nice, helpful, and knowledgeable people at Clarabridge.

Nevertheless, this week has been a whirlwind! I’ve had the chance to explore so many new computer science topics under my mentor’s guidance, and I’ve also had the opportunity to improve my project’s overall infrastructure and make it more flexible, adaptable, agnostic, and robust.

In this blog post, I will discuss in detail (1) the current architecture for my project, which includes four API endpoints: /train, /models, /predict, and /test, (2) what I used to create these endpoints, and (3) next steps.

 

My Four API Endpoints

After a week of hard work, I currently have implemented four API endpoints:

1. /train – When a POST request is sent with the correct payload, this endpoint trains the machine learning model.

PAYLOAD: The correct payload for this endpoint consists of a name, a boolean variable “test” (determines if endpoint should output cross-validation accuracy score), and the data required to train the model. This includes the feature_names (names of independent variables), y (name of the dependent variable), and rows (the data). Right now, I am using the task of fruit classification to test my project’s architecture.

– Hence, the following is an acceptable sample payload: {“name”: “fruitdata”, “test”: “True”, “data”: {“feature_names”: [“mass”, “width”, “height”, “color_score”], “y”: “fruit_label”, “rows”: [“1 apple granny_smith 192 8.4 7.3 0.55”, “1 apple granny_smith 180 8.0 6.8 0.59”, “2 mandarin mandarin 86 6.2 4.7 0.80”, “3 orange spanish_jumbo 342 9.0 9.4 0.75”, “3 orange selected_seconds 160 7.0 7.4 0.81”,“4 lemon unknown 118 6.1 8.1 0.70”]}}

OUTPUT: This endpoint returns the model’s cross validation accuracy scores as a list, the average cross validation accuracy score (calculated using sklearn and Random Forest Classification), and the current model ID as a JSON file.

– Sample output (if “test” is “True” & it’s 5-fold cross validation): { “Average cross_val_score”: “0.9333333333333332”,  “Cross_val_score”: “[0.91666667 0.91666667 0.91666667 0.91666667 1]”, “modelID”: 3}

SAVING DATA: In addition, this endpoint saves important data to various files, such that in the unfortunate event that the program crashes, the trained models are not lost.

– To accomplish this, I save the trained AI model into a .pkl (pickle) file named “{name}_{strid}.pkl.”

         – The pickle module in python is used to save and load objects.
         – {strid} refers to a unique string, more specifically a 128-bit number, that’s randomly generated and guaranteed to be unique (UUID).
         – {name} refers to the name provided by the user in the payload.

– The variable “num_models,” which equals the total number of models, is saved in the text file ‘num_of_models.txt’

– Lastly, the python dictionary “master_dict” is dumped to the json file “modelID_dict.json.”

          – The keys in the dictionary are the model IDs (ex. 1, 2, 3)
          – The values in the dictionary are the names of the pickle files (ex. “{name}_{strid}.pkl.”).
          – Thus, a master_dict dictionary with 3 models could look something like this: {“1”: “fruitdata_ce1029da-6d46-11ea-9fc2-0242ac110002.pkl”, “2”: “fruitdata2_d17eff24-6d46-11ea-9eba-0242ac110002.pkl”, “3”: “fruitdata3_e623520e-6d46-11ea-a9}

ERROR: If a GET request is sent (the wrong type of HTTP request, for it is impossible to train a ML model without feeding in the data needed to train the model), the error message “‘error’: ‘Use POST request'” is sent as a JSON file.

 

2. /models – When a GET request is sent, this endpoint returns a list of the trained models.

PAYLOAD: None (/models endpoint only accepts GET requests, which do not require the user to provide a payload)

OUTPUT: This endpoint returns a dictionary, containing the model IDs as keys and the trained models as the values, as a JSON file.

– Sample output (in this example, two models have been trained): { “1”: “RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n criterion=’gini’, max_depth=None, max_features=’auto’,\n max_leaf_nodes=None, max_samples=None,\n min_impurity_decrease=0.0, min_impurity_split=None,\n min_samples_leaf=1, min_samples_split=2,\n min_weight_fraction_leaf=0.0, n_estimators=100,\n n_jobs=None, oob_score=False, random_state=None,\n verbose=0, warm_start=False)”,
  “2”: “RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,\n criterion=’gini’, max_depth=None, max_features=’auto’,\n max_leaf_nodes=None, max_samples=None,\n min_impurity_decrease=0.0, min_impurity_split=None,\n  min_samples_leaf=1, min_samples_split=2,\n min_weight_fraction_leaf=0.0, n_estimators=100,\n n_jobs=None, oob_score=False, random_state=None,\n verbose=0, warm_start=False)}

CODE LOGIC: To accomplish this, I load data from “modelID_dict.json” to get the names of the pickle files. Then, I open the pickle files and use pickle.load() to retrieve the trained machine learning models from “{name}_{strid}.pkl.”

ERROR: If a POST request is sent (the wrong type of HTTP request, as no new information is required for this endpoint), the error message “‘error’: ‘Use GET request'” is sent as a JSON file.

 

3. /predict – When a POST request is sent with the correct payload, this endpoint predicts what the new data should be classified as using the specified model

PAYLOAD: The correct payload for this endpoint consists of a modelID (telling the program which model the user wishes to use to make predictions) and the data that the user wants to make predictions on. This includes the feature_names (names of independent variables) and rows (the data).

– Sample payload for fruit classification: {“modelID”: “3”,
“data”: {“feature_names”: [“mass”, “width”, “height”, “color_score”], “rows”: [“192 8.4 7.3 0.55”, “180 8.0 6.8 0.59”, “176 7.4 7.2 0.60”, “86 6.2 4.7 0.80”, “84 6.0 4.6 0.79”, “80 5.8 4.3 0.77”,]}}

CODE LOGIC: This endpoint uses the files “modelID_dict.json” and “{name}_{strid}.pkl” to retrieve the correct model. It then feeds the data provided by the user into that specific model, returning its predictions as a list.

OUTPUT: Sample output for payload above (model predicts what each data point should be classified as, returning a list of predictions): {“Predictions”: “[1 1 2 4 3 4]”}

ERROR:

– If the modelID is not valid (modelID<1 or modelID>num_models), this endpoint returns the error message “‘Error’: ‘modelID index out of range'” as a JSON file.
– If a GET request is sent (the wrong type of HTTP request, for it is impossible to return predictions of a ML model without feeding in the data needed to make the predictions on), the error message “‘error’: ‘Use POST request'” is sent as a JSON file.

 

4. /test – When a POST request is sent with the correct payload, this endpoint returns the accuracy of the specified model classifying new data.

PAYLOAD: The payload for this endpoint is extremely similar to the payload for the /predict endpoint. However, for “rows,” the user must specify the correct classification of each.

CODE LOGIC: This endpoint uses the files “modelID_dict.json” and “{name}_{strid}.pkl” to retrieve the correct model. It then feeds the data provided by the user into that specific model, calculating its predictions, and comparing them to what they are actually classified as (defined in “rows”).

OUTPUT: This endpoint outputs the cross validation accuracy scores as a list, the average cross validation accuracy score (calculated using sklearn and Random Forest Classification), and the current model ID as a JSON file.

– Sample output (if modelID is 1 & it’s 3-fold cross validation): {“Average cross_val_score”: “0.9333333333333332”, “Cross_val_score”: “[0.9 0.9 1]”, “modelID”: 1}

ERROR: If a GET request is sent (the wrong type of HTTP request, for it is impossible to return accuracy of a ML model without feeding in the data needed to make the predictions on), the error message “‘error’: ‘Use POST request'” is sent as a JSON file.

 

How I Created the 4 API Endpoints

I built these endpoints using Python, Docker, and Flask. To build it, I researched Flask routing, UUID, HTTP status codes (200s v. 400s v. 500s), docker volumes, reading and writing from text files in Python, reading and writing to JSON files in Python, Python dictionaries, and the Python pickle module (dumping and loading objects from .pkl files). I also learned about uploading files with Docker, researched HTML input types, learned the difference between cold and warm starts, as well as built a flexible data structure for POST request payloads. 

To make the program more efficient, I read num_models from “num_of_models.txt” & master_dict from “modelID_dict.json” at the beginning of the python program “app.py”. As a result, these two files don’t have to be reopened in the /models, /predict, and /test endpoints. If it is the first time the server starts up, I catch the exception, and I initialize num_models to 0 and master_dict to an empty dictionary.

 

Next Steps

Now that I have this infrastructure built, it will be easier to implement my humor detection model, as I can use these same endpoints to train and test my model. To prepare for this next step, I will read up on what makes something funny, along with how to train a machine learning model to detect humor. I plan on researching word embeddings and various NLP (natural language processing) algorithms, including Bag of Words (BoW), Term Frequency – Inverse Document Frequency (TF-IDF), Co-occurrence matrices, Word2Vec (CBOW: Continues Bag-of-Words model, Skip-gram model), GloVe, and fastText (the one I will be using for my project). I will also research Docker compose, Google BERT, and other state-of-the-art technologies in this rapidly advancing field.

Thanks for reading! Stay safe. Until next time!

3 Replies to “Week 2: Creating the Architecture for My Project”

  1. Shang Z. says:

    Hi Ethan, the model that you are testing right now is so interesting, and you explained it so well! Excited to see how you will transition from fruit classification to your humor detection model. I have really enjoyed learning from your posts.

    1. Ethan H. says:

      Thanks Tina! Hopefully I’ll be able to use the framework I built that works for fruit classification into a humor detection model, but it will no doubt be a long and arduous process. Hope you stay tuned for updates! I’ll include some code in my next post so you can play around with my fruit classification model if you want to 🙂

Leave a Reply