A Review of Variable Selection Methods in Partial Least Squares Regression

This lab uses a case study in chemical science to highlight key concepts in regression modeling. At that place are several questions embedded throughout the lab that will assume knowledge of the caret package, principle components assay (PCA), and multiple linear regression. The lab will build cognition and intution in information exploration and pre-processing, variable selection, principle components regression, partial to the lowest degree squares regression, and LASSO regression.

Case Study - Drug Solubility

Chemical compounds, such every bit drugs, tin be represented through chemical formulas describing their atomic components. For example, the popular drug aspirin contains nine carbon, 8 hydrogen, and iv oxygen atoms held together by various different chemical bonds. Using the chemical configuration, certain additional measures, such equally molecular weight, surface surface area, or polarity, can exist derived. However, in that location are many useful attributes that cannot exist derived analytically, and their measurement typically requires careful labratory experiments.

One such attribute (that cannot be determined analytically) is a compound's solubility, or how it behaves in a liquid solution. Solubility is important considering it determines whether a drug can be given orally or needs to exist administered via injection. The instance study in this lab will explore modeling the solubility of chemical compounds using analytically derived variables.

Our instance written report is based upon enquiry washed by Teko et al. (2001) and Huuskonen (2000), and borrows from the text Applied Predictive Modeling by Max Kuhn and Kjell Johnson. The goal is to model the solubility of a chemical compound using descriptors of its chemical structure. The idea is to utilize the model to screen compounds, some of which might not yet exist, for diserable properties. In general, this type of report investigates the quantitative structure-activity relationship (QSAR), and QSAR modeling is becoming increasingly prevalent in many areas of chemistry.

Data Exploration and Pre-processing

The first step in most modeling building applications is to reserve a portion of the information for the examination gear up. Recall that we utilize the exam set but at the very finish of the model building process, information technology allows us to get an unbiased assessment of how successful the model will exist on new data. This is necessary because everything we do (ranging from data visualization to variable selection to model choice) using the grooming data induces the possibility of overfitting.

            library(caret) #install.packages("AppliedPredictiveModeling") library(AppliedPredictiveModeling)  information(solubility) Ten <- rbind(solTestX, solTrainX) y <- c(solTestY, solTrainY) data <- data.frame(Solubility = y, Ten)  prepare.seed(1234) exam.id <- sample(1:nrow(X), size = .ii*nrow(Ten)) ## Randomly choose 20% of rows for test set  test.data <- data[test.id,]  ## Subset to include rows designated to test fix train.data <- information[-test.id,]  ## Exclude rows designated to test gear up

After partitioning the data we can use the training information to begin to understand the predictor variables and how they relate with the event variable "Solubility".

            head(train.information)

These data contain: - 208 binary variables that point the presence/absenteeism of a specific chemical substructure - 16 count variables that indicate number of bonds or atoms of a specific blazon - 4 continuous variables indicating analytically derived characteristics like molecular weight or surface area - The outcome, solubility, which is measured on the log-calibration

Considering the overall number of predictors is moderate-to-high, nosotros should be interested in variable selection, and/or dimension reduction. The first affair nosotros should do is bank check for highly correlated predictors. Correlation is questionably defined for binary variables, so nosotros will focus on the 20 count/continuous variables:

            ## Simply show the 20 analytic vars #install.packages("corrplot") library(corrplot)  ## A library for visualizing correlations cor.matrix <- cor(train.information[,210:ncol(railroad train.data)])  ## Find pairwise correlation matrix corrplot(cor.matrix, type = "upper", method = "square") ## Plot the corr matrix

Question 1: Based upon the correlation matrix plot, provide an statement for reducing the dimension of these twenty variables. And then use PCA and the fviz_screeplot function to notice a lower dimensional set of variables that retains at to the lowest degree 90% of the variation in these 20 variables. Be sure to use the statement scale = TRUE. Many components will yous use?

Question 2: Try to interpret the first ii components using fviz_contrib (with the argument choice = "var") to help place the most important contributing variables.

Standing our analysis, nosotros'll utilise eight variables derived from PCA as a lower dimensional representation of the original 20 variables. The scores for each component for each observation are stored in 10 within the object created by prcomp:

            P <- prcomp(railroad train.data[,210:ncol(railroad train.data)], calibration = TRUE) #caput(P$x)

For clarity, we'll setup a new training information object using simply these 8 derived variables (and the outcome variable Solubility):

            train.data2 <- information.frame(Solubility = railroad train.information$Solubility, P$x[,1:eight])

Relationships with the event

PCA (by pattern) ignores the outcome variable; consequently, there is no garauntee that each (or whatsoever) component is predictive of solubility. We should explore this graphically:

            library(tidyr) ## Tidy these data for plotting with ggplot plot.df <- gather(train.data2, key = "Component", value = "X", 2:nine)  ggplot(plot.df, aes(x = 10, y = Solubility)) + geom_point() +   geom_smooth(method = "loess", se = Faux, bridge = ii) +  ## bridge alters the smoothness   facet_wrap( ~ Component, scales = "gratuitous")

Interpretting this plot, we run into that some of the components announced to exist associated with solubility, while others don't seem to be very predictive. Nosotros as well see that fourth component might have a non-linear relationship with solubility. Typically, existent non-linear relationships occur for an identifiable reason, so we should endeavour to translate this component:

            fviz_contrib(P, pick = "var", axes = 4)

            P$rotation[,4]

            ##         MolWeight          NumAtoms      NumNonHAtoms          NumBonds  ##      0.2584510545      0.0315405601      0.0623818844      0.0146473635  ##      NumNonHBonds      NumMultBonds       NumRotBonds       NumDblBonds  ##      0.0266645116     -0.3458979451      0.1256610469      0.1420161837  ##  NumAromaticBonds       NumHydrogen         NumCarbon       NumNitrogen  ##     -0.3703608707     -0.0002358333     -0.0729175467     -0.2157567694  ##         NumOxygen         NumSulfer       NumChlorine        NumHalogen  ##      0.1394703466      0.0019996931      0.5127105621      0.5229855536  ##          NumRings HydrophilicFactor      SurfaceArea1      SurfaceArea2  ##     -0.1569204616      0.0404384449      0.0135100957      0.0216224609

A compound will have loftier score in this component if it has a lot of halogen/cholrine atoms, but not many "Mult" or "Aromatic" bonds. Similarly, it volition have a low score in this component if it has few halogen/cholrine atoms and a lot of "Mult" or "Aromatic" bonds. Both loftier and depression scores seem to represent with low solubility.

If you were actually working on this model you might desire to consult with a chemist regarding this interpretation, since it implies a quadratic effect should exist used when including this component in the model. All of the other component'southward relationships with solubility announced to be roughly linear (we shouldn't worry too much about the points near the edges of each plot that are clearly curving loess trendline)

Model Building

An reward of using caret for model building is the common formula syntax that is compatible with many types of models. In this syntax, the outcome variable is specified offset and is seperated from predictor variables past the ~ graphic symbol. Adjacent predictor variables (or functions of them!) are specified, separated by + characters.

            ## We volition evaluate models using repeated cross validation fit.control = trainControl(method = "repeatedcv", number = 5, repeats = 10)   ## Notice original vars virtually correlated with solubility cors <- cor(train.data[,c(1,210:ncol(railroad train.information))]) cors[,ane]

            ##        Solubility         MolWeight          NumAtoms      NumNonHAtoms  ##       1.000000000      -0.628474442      -0.385787758      -0.542849815  ##          NumBonds      NumNonHBonds      NumMultBonds       NumRotBonds  ##      -0.410300151      -0.554188802      -0.530108708      -0.118688770  ##       NumDblBonds  NumAromaticBonds       NumHydrogen         NumCarbon  ##       0.009796097      -0.518914174      -0.187763520      -0.579139809  ##       NumNitrogen         NumOxygen         NumSulfer       NumChlorine  ##       0.143343548       0.116052266      -0.084080893      -0.503666138  ##        NumHalogen          NumRings HydrophilicFactor      SurfaceArea1  ##      -0.492114144      -0.513914241       0.301403694       0.196919572  ##      SurfaceArea2  ##       0.148005281

            ## Specify a model using the ii most correlated vars model1 <- formula(Solubility ~ MolWeight + NumCarbon)  ## Fit model1 set.seed(15) fit.lm1 <- train(model1, information = train.data, method = "lm", trControl = fit.command)  ## Model using the showtime two principle components model2 <- formula(Solubility ~ PC1 + PC2)  prepare.seed(15) fit.lm2 <- train(model2, data = train.data2, method = "lm", trControl = fit.control)  ## Compare these two models library(tidyr) resamps <- resamples(list(lm1 = fit.lm1, lm2 = fit.lm2)) resamps2 <- gather(resamps$values, cardinal = "Var", value = "Val", ii:ncol(resamps$values)) resamps2 <- separate(resamps2, col = "Var", into = c("Model", "Metric"), sep = "~")  ggplot(resamps2, aes(x = Model, y = Val)) + geom_boxplot() + facet_wrap(~ Metric, scales = "free")

Question 3: Based upon the results above which model is superior, the linear regression model that uses the ii variables which were most correlated with Solubility, or the linear regression model that uses the get-go 2 principle components?

Neither of these models is probable to exist optimal, the general workflow we should utilize is to add a predictor/grouping of predictors (or change the course of an existing predictor), so utilize caret to determine if the modify improves out-of-sample prediction error. The idea is to slowly increase the complication of the model until we strike an ideal balance between performance and interpretability for our goals.

            ## Some additional models: model3 <- formula(Solubility ~ PC1 + PC2 + PC3) model4 <- formula(Solubility ~ PC1 + PC2 + PC3 + PC4) ## The poly part tin can be used to add not-linear effects model4q <- formula(Solubility ~ PC1 + PC2 + PC3 + poly(PC4, degree = 2))   set.seed(15) fit.lm3 <- train(model3, data = train.data2, method = "lm", trControl = fit.command) set up.seed(15) fit.lm4 <- train(model4, data = train.data2, method = "lm", trControl = fit.control) set.seed(15) fit.lm4q <- train(model4q, data = train.data2, method = "lm", trControl = fit.control)  ## Compare these new models resamps <- resamples(list(lm2 = fit.lm2, lm3 = fit.lm3, lm4 = fit.lm4, lm4q = fit.lm4q)) resamps2 <- gather(resamps$values, fundamental = "Var", value = "Val", 2:ncol(resamps$values)) resamps2 <- separate(resamps2, col = "Var", into = c("Model", "Metric"), sep = "~")  ggplot(resamps2, aes(ten = Model, y = Val)) + geom_boxplot() + facet_wrap(~ Metric, scales = "free")

Question 4: Based upon the modeling results displayed above, which model offers the performance? Too, is there an reward in using a quadratic outcome for the fourth principle component?

We shouldn't stop at iv principle components, we'll check a few more:

            ## Permit'due south attempt a few more models model5 <- formula(Solubility ~ PC1 + PC2 + PC3 + PC4 + PC5) model6 <- formula(Solubility ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6) model7 <- formula(Solubility ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7) model8 <- formula(Solubility ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8)   fix.seed(fifteen) fit.lm5 <- train(model5, information = train.data2, method = "lm", trControl = fit.control) set.seed(15) fit.lm6 <- train(model6, data = train.data2, method = "lm", trControl = fit.command) set.seed(15) fit.lm7 <- train(model7, data = train.data2, method = "lm", trControl = fit.control) set.seed(15) fit.lm8 <- train(model8, information = train.data2, method = "lm", trControl = fit.control)   ## Compare these new models resamps <- resamples(list(lm2 = fit.lm2, lm3 = fit.lm3, lm4 = fit.lm4, lm5 = fit.lm5,                           lm6 = fit.lm6, lm7 = fit.lm7, lm8 = fit.lm8)) resamps2 <- gather(resamps$values, key = "Var", value = "Val", 2:ncol(resamps$values)) resamps2 <- separate(resamps2, col = "Var", into = c("Model", "Metric"), sep = "~")  ggplot(resamps2, aes(ten = Model, y = Val)) + geom_boxplot() + facet_wrap(~ Metric, scales = "free")

This suggests the model using the offset 5 principle components is the most sensible choice. Let'southward compare this with the models suggested by the all-time subsets and stepwise selection algorithms. We'll explore applying these algorithms to both the original data and new variables derived using principle components.

            library(CAST) ## Note: best subsets takes a lot of time to run best_subs <- bss(train.data2[,-ane], train.data2$Solubility, method = "lm", trControl = fit.command, verbose = Fake, seed = 15)  best_subs$perf_all[gild(best_subs$perf_all$RMSE)[1:half-dozen],]

            ##     var1 var2 var3 var4 var5 var6 var7 var8     RMSE          SE nvar ## 160  PC1  PC2  PC3  PC4  PC5  PC7 <NA> <NA> ane.047948 0.006644526    6 ## 33   PC1  PC2  PC3  PC4  PC5  PC7  PC8 <NA> 1.048886 0.006639629    7 ## 128  PC1  PC2  PC3  PC4  PC5  PC6  PC7 <NA> i.049174 0.006650362    7 ## ane    PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 1.050149 0.006642596    eight ## 222  PC1  PC2  PC3  PC4  PC5 <NA> <NA> <NA> one.051467 0.007040616    5 ## 97   PC1  PC2  PC3  PC4  PC5  PC8 <NA> <NA> one.052361 0.007034444    half dozen

            ## How to evaluate recursive feature elimination ctrl <- rfeControl(functions = lmFuncs, method = "repeatedcv", number = 5, repeats = 10, verbose = FALSE)  ## Do the backward elimination backward_subs <- rfe(train.data2[,-one], train.data2$Solubility,  rfeControl = ctrl, seed = 15, sizes = ii:8) backward_subs

            ##  ## Recursive feature pick ##  ## Outer resampling method: Cross-Validated (5 fold, repeated ten times)  ##  ## Resampling performance over subset size: ##  ##  Variables  RMSE Rsquared    MAE  RMSESD RsquaredSD   MAESD Selected ##          two i.355   0.5708 one.0741 0.05595    0.05061 0.04460          ##          iii one.299   0.6047 1.0357 0.04497    0.04313 0.04251          ##          iv 1.059   0.7375 0.8198 0.05285    0.03097 0.04147          ##          v 1.056   0.7391 0.8230 0.05293    0.03002 0.04176          ##          6 1.047   0.7435 0.8182 0.05243    0.02970 0.04036        * ##          vii 1.049   0.7427 0.8195 0.05202    0.02957 0.04027          ##          8 one.050   0.7423 0.8202 0.05166    0.02940 0.03997          ##  ## The superlative v variables (out of half dozen): ##    PC2, PC5, PC4, PC1, PC7

            backward_subs$fit

            ##  ## Call: ## lm(formula = y ~ ., information = tmp) ##  ## Coefficients: ## (Intercept)          PC2          PC5          PC4          PC1   ##     -2.7607      -0.7910       0.3798      -0.2891      -0.2532   ##         PC7          PC3   ##     -0.1298      -0.0939

Best subsets and recursive feature elimination both favor the same model, one with PC1, PC2, PC3, PC4, PC5, and PC7. These results are pretty similar to what we saw selecting variables manually.

We also might consider applying these variable selection algorithms to the orignal variables (rather than the derived principle components); still, this causes some problems:

            ## This produces many alert messages! backward_subs_original <- rfe(railroad train.information[,210:ncol(railroad train.data)], train.data$Solubility, rfeControl = ctrl, seed = xv)

Several of the original variables are linear combinations of the others. Principle components naturally accomodates this, only for regression information technology is problematic and results in untrustworthy models. We tin can use the FindLinearCombos part in the caret bundle to assist us out:

            lin.coms <- findLinearCombos(train.data[,210:ncol(railroad train.data)]) lin.coms$linearCombos

            ## [[1]] ## [1] 5 2 3 four ##  ## [[2]] ## [i] 10  2  3

Very clearly, variables ii through 5 are non linearly independent, which makes sense given that the total number of atoms is a function of the number of nitrogen, carbon, etc. We should remove variables so that all columns of the information matrix are linearly independent. Past further inspecting the variables, it seems like "NumBonds", "NumNonHBonds", "NumAtoms", and "NumNonHAtoms" are probable to exist functions of other variables and should be removed:

            to.remove <- which(names(railroad train.data[,210:ncol(train.data)]) %in% c("NumAtoms", "NumBonds", "NumNonHAtoms", "NumNonHBonds")) new.train.X <- train.information[,210:ncol(train.data)] new.train.X <- new.train.X[,-to.remove] findLinearCombos(new.train.X)  ## Check if any linear combos remain

            ## $linearCombos ## listing() ##  ## $remove ## NULL

Now nosotros should be apply to utilize recursive feature elimination without bug:

            backward_subs_original <- rfe(new.train.X, train.data$Solubility, rfeControl = ctrl, seed = 15, sizes = 2:xvi) backward_subs_original

            ##  ## Recursive feature selection ##  ## Outer resampling method: Cross-Validated (5 fold, repeated 10 times)  ##  ## Resampling performance over subset size: ##  ##  Variables   RMSE Rsquared    MAE  RMSESD RsquaredSD   MAESD Selected ##          2 1.7525   0.2829 1.3876 0.07317    0.04920 0.05764          ##          3 one.7494   0.2850 1.3830 0.07487    0.05192 0.05998          ##          4 1.6692   0.3490 one.3211 0.06877    0.05184 0.05948          ##          5 1.5957   0.4074 1.2548 0.08815    0.05675 0.06192          ##          6 1.5751   0.4219 1.2323 0.09565    0.06387 0.07759          ##          7 i.4887   0.4829 1.1458 0.08471    0.05277 0.06329          ##          eight 1.3093   0.5989 1.0090 0.08368    0.04638 0.06064          ##          9 1.0183   0.7568 0.8009 0.05508    0.03053 0.04179          ##         10 1.0128   0.7593 0.7956 0.05703    0.03148 0.04338          ##         eleven 0.9939   0.7684 0.7858 0.05147    0.02825 0.03911          ##         12 0.9852   0.7724 0.7774 0.05483    0.02900 0.04122          ##         13 0.9771   0.7760 0.7722 0.05425    0.02924 0.04112          ##         14 0.9679   0.7803 0.7670 0.05215    0.02860 0.03850          ##         15 0.9578   0.7847 0.7549 0.05232    0.02864 0.03864          ##         sixteen 0.9390   0.7930 0.7329 0.05550    0.03011 0.04034        * ##  ## The top five variables (out of xvi): ##    NumSulfer, NumMultBonds, NumAromaticBonds, NumNitrogen, NumOxygen

Question v: Of all the models considered and then far, which practise y'all prefer? Create a plot that shows the top 3 models side-by-side.

Partial Least Squares (PLS)

Notice how the cross-validated RMSE is essentially lower for many models than it was when we did principle components regression. This is not necessarily unusual, principle components analysis is an unsupervised learning approach, so it is effective in dimension reduction, but at that place is no garauntee that whatsoever of the new dimensions will be associated with the response.

PCA blindly chases variability with no consideration of the outcome variable, so you might look that a more targeted approach which considers the issue variability volition be more than effective for supervised learning problems. One such method is Partial Least Squares (PLS).

PLS is similar to PCA in the sense that it constructs new derived dimensions in \(X\), but information technology operates such that covariance between \(X\) and \(Y\) is maximized for each newly derived dimension. Thus, the beginning component of PLS is a linear combination of the original variables that is near strongly related with \(Y\).

            set.seed(fifteen) pls.fit <- train(Solubility ~ .,  information = railroad train.data[,c(1,210:ncol(train.information))], method = "pls", trControl = fit.command, tuneGrid = data.frame(ncomp = one:20)) pls.fit

            ## Partial To the lowest degree Squares  ##  ## 1014 samples ##   20 predictor ##  ## No pre-processing ## Resampling: Cross-Validated (5 fold, repeated 10 times)  ## Summary of sample sizes: 812, 811, 812, 810, 811, 813, ...  ## Resampling results across tuning parameters: ##  ##   ncomp  RMSE          Rsquared   MAE          ##    1     one.550330e+00  0.4426925  one.170633e+00 ##    ii     i.191208e+00  0.6691320  9.352446e-01 ##    three     1.101970e+00  0.7164346  viii.532216e-01 ##    four     1.088074e+00  0.7231585  eight.409514e-01 ##    v     one.066375e+00  0.7339962  8.245391e-01 ##    6     1.033949e+00  0.7500299  7.974753e-01 ##    vii     i.008583e+00  0.7621118  7.795658e-01 ##    viii     9.712156e-01  0.7794245  7.582987e-01 ##    9     9.644149e-01  0.7825186  7.506145e-01 ##   x     9.558243e-01  0.7865601  7.450759e-01 ##   eleven     9.526745e-01  0.7878583  7.452947e-01 ##   12     9.503084e-01  0.7889196  7.458654e-01 ##   13     nine.478888e-01  0.7901273  7.441893e-01 ##   14     9.409136e-01  0.7931063  7.375683e-01 ##   15     9.394829e-01  0.7937249  7.348799e-01 ##   16     9.367423e-01  0.7949228  vii.316322e-01 ##   17     9.364424e-01  0.7950754  7.318705e-01 ##   18     2.548877e+13  0.4291598  2.074847e+13 ##   19     6.145183e+13  0.2049947  four.805330e+thirteen ##   20     one.246062e+14  0.2095853  9.702521e+13 ##  ## RMSE was used to select the optimal model using the smallest value. ## The terminal value used for the model was ncomp = 17.

            ggplot(pls.fit)

Question 6: Thinking about the 20 original variables nosotros used as predictors in PLS, why does the RSME skyrocket when more than 17 components are used? (Hint: retrieve about the challenges we had applying variable selection algorithms to these 20 variables in the last section)

Nosotros can employ the varImp function to gauge which variables are most influential in the "best" PLS model (all-time is determined past the lowest RMSE):

            varImp(pls.fit)

            ## pls variable importance ##  ##                   Overall ## NumMultBonds      100.000 ## MolWeight          97.570 ## NumChlorine        86.748 ## NumOxygen          81.495 ## NumAromaticBonds   69.860 ## NumCarbon          68.246 ## SurfaceArea1       65.018 ## SurfaceArea2       61.521 ## NumNonHAtoms       50.229 ## NumHalogen         47.978 ## HydrophilicFactor  37.829 ## NumRotBonds        36.242 ## NumNitrogen        32.875 ## NumSulfer          32.675 ## NumDblBonds        32.321 ## NumNonHBonds       28.199 ## NumHydrogen        twenty.422 ## NumRings            5.814 ## NumBonds            i.680 ## NumAtoms            0.000

            #pls.fit$finalModel$loadings # Run across the loadings for each component

Question vii: Compare the PLS models (contained in pls.fit) with the previous best models (which you lot plotted in Question 5). Update your plot (created in Question v) to include i of these PLS models.

Least Absolute Shrinkage and Choice Operator (LASSO)

So far we've neglected the 208 binary construction variables, which could potentially meliorate our predictions. In that location are too many of these variables to await good results from foward/backward variable selection, and PCA/PLS aren't designed for binary variables (though there is some debate on whether they could be applied). An efficient and powerful a alternative method, the LASSO, is well-suited for this state of affairs.

While the LASSO is implemented in caret, the ncvreg package itself offers a better implementation of cross validation:

            library(ncvreg) set.seed(15) cv.fit <- cv.ncvreg(X = train.information[,-1], y = train.data$Solubility, penalization = "lasso") summary(cv.fit)

            ## lasso-penalized linear regression with n=1014, p=228 ## At minimum cross-validation error (lambda=0.0069): ## ------------------------------------------------- ##   Nonzero coefficients: 133 ##   Cross-validation error (deviance): 0.47 ##   R-squared: 0.89 ##   Signal-to-racket ratio: 8.03 ##   Scale estimate (sigma): 0.687

            plot(cv.fit)

We can come across that the cross-validated \(R^2\) goes upwards by most 10% using this model (which incorperates the 208 binary construction features). The cross-validation error reported by ncvreg is not foursquare-rooted, and so for this LASSO model \(RMSE = \sqrt{0.47} = 0.686\).

In whatever LASSO model only a subset of the available predictors are active (make a non-zero contribution to the model). We can come across the fix of selected features (and their regression coefficients) using the commands beneath:

            ## Within cv.fit is the fitted lasso model (fit) ## Within the fitted model is beta, a matrix of regression coefficients for each lambda ## We desire only the cavalcade of beta corresponding to the lambda that minimizes CV RSME all_coefs <- cv.fit$fit$beta[,cv.fit$fit$lambda == cv.fit$lambda.min] all_coefs[all_coefs != 0]

Question 8: (THIS QUESTION Volition Not Exist GRADED, IT TAKES TOO LONG FOR ffs to RUN) Utilise forward feature selection (implemented in the ffs role, see the slides for an case) on the data (including the 208 binary structure features) using method = "lm". How does this model compare to the LASSO model? (compare their predictive performance and selected variables)

Conclusions

In this application nosotros saw that using a reduced set of derived dimensions offered an advantage over using subsets of the original variables, merely just if the construction of those dimensions took the outcome variable into consideration (partial least squares performed ameliorate than principle components regression). We then saw that the 208 binary features could aid prediction even further the appropriate method was used to perform characteristic selection.

The final stride in this analysis would be to apply our selected model to the test data. This would be done using the predict function, and a would be a fundamental determinant of how successful your chosen model is.

johnsthasce.blogspot.com

Source: https://remiller1450.github.io/s230s19/var_sel_lab.html

A Review of Variable Selection Methods in Partial Least Squares Regression

Case Study - Drug Solubility

Data Exploration and Pre-processing

Relationships with the event

Model Building

Partial Least Squares (PLS)

Least Absolute Shrinkage and Choice Operator (LASSO)

Conclusions

0 Response to "A Review of Variable Selection Methods in Partial Least Squares Regression"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel