Week 4 & 5: Productivity Tools, Wrangling, Linear Regression and Machine Learning

Apr 09, 2021

Hello and welcome back to my blog! I hope everyone had a relaxing spring break last week! It’s been two weeks since you’ve last heard from me so let’s discuss the progress I’ve made recently.

Last week, I finished learning course 5 on Productivity Tools and course 6 on Wrangling. Productivity Tools involved introducing Git and GitHub, understanding how to connect Git to be used alongside Rstudio, and learning how to work with Unix. Git is a version control system that tracks changes and coordinates the editing of code, while GitHub is a hosting system for code, often used as a platform to share code and collaborate with other programmers. Git is most effectively used with Unix, but it can also interface with Rstudio. I mainly learned the basics of Unix such as understanding how to refer to all files, directories, and executables and using arguments to execute changes in code or file location. For Wrangling, I was introduced to data import, tidy data, string processing, and text mining. All of these topics are meant to help reshape or extract data into more uniform and organized tables using various functions such as gather(), spread(), unite(), inner-join(), left-join(), bind_rows(), intersect(), set_equal(), html_nodes(), str_detect(), parse_number(), and regex.

This shows what the Git environment looks like and coding commands with Unix


This week, I finished learning course 7 on Linear Regression and am currently in the process of learning course 8 on Machine Learning. Linear Regression essentially discusses correlation and confounding variables, stratification and variance, and graphing linear models. After understanding correlation coefficients as an informative summary of how two variables that move together can be used to predict one variable using the other, I learned about residual sum of squares (RSS), which measures the distance between the time value and the predicted value given by a regression line, and least-squares estimates ( LSE), which are values that minimize the RSS.With functions like lm(), predict(), as_tibble(), do(), and glance(), I was able to practice developing various graphs as seen below.

Currently, I’ve just started to learn topics on machine learning. Machine Learning prediction tasks can be divided into categorical and continuous outcomes, which are referred to as classification and prediction, respectively. Categorical data is data that have specific groups of data, such as gender, while continuous data exists on a spectrum, such as with height. To mimic the evaluation process, we randomly split our data into two groups: the training set and the test set. Known outcomes are used in a training set to generate indexes for prediction and then a test set uses those indexes to predict outcomes from assumingly unknown data (for testing purposes, we just pretend that we don’t know the outcome in order to compare answers and ensure that the algorithm is accurate). Improvements can be made to the overall accuracy of the algorithm through analyzing sensitivity,  the proportion of actual positive outcomes correctly identified as such, and specificity, the proportion of actual negative outcomes that are correctly identified as such. For optimization purposes, sometimes it is useful to have a one-number summary, such as through a harmonic average, instead of studying both sensitivity and specificity separately. Subsequently, receiver operating characteristic (ROC) curves or precision-recall plots are effective in displaying the true positive rate (TPR) and the false positive rate (FPR) to evaluate the accuracy and the harmonic average.

Besides my progress in studying R, I was able to utilize some of the techniques that I learned in creating linear models for Mr. Swain’s bacterial species database. Although my graphs are definitely rudimentary at best, it was very interesting to be able to manipulate and organize actual formal biology lab data. Additionally, I met with Mr. Swain and an undergraduate student, Sanjay,  yesterday to discuss my progress and other specifics of the database. My graphing skills still need some practice but I’m sure that with their guidance, I will be able to improve the quality and accuracy of my work. I will be meeting with Mr. Swain and Sanjay again in two weeks to present the results of my exploratory analysis.

For next week, I will continue making linear models on my skin cell database, as well as studying more R in the data science course. I hope to be nearly finished with the last course on machine learning, but because of the immense amount of information in this important unit, it might extend into the week after. Nevertheless, I will try my best to improve my coding skills and collaborate more with my professor. Thanks for tuning in and see you next week!

2 Replies to “Week 4 & 5: Productivity Tools, Wrangling, Linear Regression and Machine Learning”

  1. Jiaming Z. says:

    Wow that was a lot happening! These R functions seems very interesting and huge respect to your effort in machine learning and data science. Hope you good luck on finishing up your R course and your project!

  2. Peter L. says:

    That’s a lot of colorful graphs! Congrats on learning the Git commands too! At first they can be a handful but as time goes I’m sure they’ll be EZPZ 🙂 Though I don’t know what exactly do these graphs portray and what they mean, I look forward to what you will do with them! Keep up the impressive work!

Leave a Reply