Hi everyone and welcome back to another blog! For this week, I continued making linear regression models for my internship while also learning about smoothing, matrices, and cross-validation in my data science course.
For the past two weeks, I have been doing exploratory analysis of cell metabolism for a skin database of different bacterial species and their various measurements. Things have been going pretty well and I continue to communicate with the professor and ask questions. This afternoon, I will be meeting with Mr. Swain again to discuss my findings and assign further tasks.
For the machine learning course, I am nearly done with all the sections and have learned many interesting concepts. For instance, smoothing is a very powerful technique used all across data analysis. It is designed to detect trends in the presence of noisy data in cases in which the shape of the trend is unknown. The general idea of bin smoothing is to group data points into strata in which the value of f(x) can be assumed to be constant, since it would change extremely slowly in small windows of time. A limitation of this smoothing method is that it requires small windows for the constant assumptions to hold, which may lead to imprecise estimates of f(x). Another smoothing method, local weighted regression (loess), considers larger window sizes and assumes f(x) is locally linear, resulting in a smoother fit. Besides smoothing, matrix notation and linear algebra are key elements when describing machine learning techniques. Functions such as as.matrix(), rowSums(), colSds(), sweep(), crossprod(), qr(), and the use of logical operations and indexes all allow the three main types of objects – scalars, vectors, and matrices – to be converted from one to the other, binarized, averaged, multiplied, or reorganized. Another essential machine learning concept is cross-validation, a way to measure the predictive performance of a statistical model. Variations of cross-validation include leave-k-out, in which k observations are used as validation data and the remaining one is used to train the model, and k-fold, in which the original sample is partitioned into k subparts or folds and one group is selected for each iteration as validation data with the remaining groups as training data. The diagram below demonstrates k-fold cross validation:
For next week, the plan is to finish the machine learning course and hopefully get my certificate of completion. Additionally, I will be starting on a new task for my internship so I will update everyone on my work. That’s it for this week so I’ll see you all later! Thanks for tuning in!