Working in machine learning field is not only about building different classification or clustering models. It’s more about feeding the right set of features into the training models.
This process of feeding the right set of features into the model mainly take place after the data collection process.
Once we have enough data, We won’t feed entire data into the model and expect great results. We need to pre-process the data.
In fact, the challenging and the key part of machine learning processes is data preprocessing.
Below are the key things we indented to do in data preprocessing stage.
- Feature transformation
- Feature selection
Feature transformation is to transform the already existed features into other forms. Suppose using the logarithmic function to convert normal features to logarithmic features.
Feature selection is to select the best features out of already existed features. In this article, we are going to learn the basic techniques to pick the best features for modeling.
Before we drive further. Let’s have a look at the table of contents.
Table of contents:
- Why modeling is not the final step
- The role of correlation
- Calculating feature importance with regression methods
- Using caret package to calculate feature importance
- Random forest for calculating feature importance
- Conclusion
- Related courses
- Exploratory data analysis in r
- Machine learning A-Z in r
Why Modeling is Not The Final Step
Like a coin, every project has two sides.
- Business side
- Technical side
The technical side deals with data collection, processing and then implementing it to get results. The business side is what envelops the technical side.
It starts with defining the requirements, hands it over to the technical team for generating results and then take over for converting those results into actionable insights. This is why it is necessary for both teams that they know what was implemented behind the scenes in the project.
The team handling the technical part may consider models and process as their core project deliverable but just running the model and getting highly accurate models is never the end goal of the project for the business team.
It is the understanding of the project which makes it actionable. Thus, if you make a model, but you don’t know what is happening around it then it is a black box which may be perfect for lab results but not something that can be put into the production.
While one may not be concerned with each and every detail of what is happening. One is definitely interested in what actionable insights can be derived out of the model. Using variable importance can help achieve this objective.
Most models have a method to generate variable importance which indicates what features are used in the model and how important they are. Variable importance also has a use in the feature selection process.
As the Occam’s Razor principle states.
The simplest models are the best.
Finding the best features to use in the model based on decreasing variable importance helps one to identify and select the features which produce 80% of the results and discard the rest of the variables which account for rest 20% of the accuracy.
Generally looking at variables (Features) one by one can also help in understanding what features are important and figuring out how do they contribute towards solving a business problem.
It is not difficult to derive variable importance based on the methodology being followed.This is why variable importance can be calculated in more than one way. It’s not a rocket science.
This article describes some such ways.
Role of Correlation
If you are working with a model which assumes the linear relationship between the dependent variables, correlation can help you come up with an initial list of importance. It also works as a rough list for nonlinear models.
The idea is that those features which have a high correlation with the dependent variable are strong predictors when used in a model.
Let us generate a random dataset for this article.
1 2 3 4 5 6 7 8 9 10 11 12 13 | # Use the library cluster generation to make a positive definite matrix of 15 features library(clusterGeneration) S = genPositiveDefMat(“unifcorrmat”,dim=15) # create 15 features using multivariate normal distribution for 5000 datapoints library(mnormt) n = 5000 X = rmnorm(n,varcov=S$Sigma) |
Let us now create a dependent feature Y plot a correlation table for these features.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | # Create a two class dependent variable using binomial distribution Y = rbinom(n,size=1,prob=0.3) data = data.frame(Y,X) # Create a correlation table for Y versus all features cor(data,data$Y) [,1] Y 1.000000000 X1 –0.013270223 X2 –0.002782848 X3 –0.005647999 X4 –0.018287654 X5 –0.017303147 X6 0.006512963 X7 –0.013494603 X8 –0.008466241 X9 –0.001837453 X10 0.015101810 X11 0.018945108 X12 –0.005708211 X13 –0.009837814 X14 –0.008292952 X15 –0.009675556 |
As expected, since we are using a randomly generated dataset, there is little correlation of Y with all other features. These numbers may be different for different runs.
In this case, the correlation for X11 seems to be the highest. Had we to necessarily use this data for modeling, X11 will be expected to have the maximum impact on predicting Y. In this way, the list of correlations with the dependent variable will be useful to get an idea of the features that impact the outcome.
While plotting correlations, we always assume that the features and dependent variable are numeric. If we are looking at Y as a class, we can also see the distribution of different features for every class of Y.
Using Regression to Calculate Variable Importance
The summary function in regression also describes features and how they affect the dependent feature through significance. It works on variance and marks all features which are significantly important.
Such features usually have a p-value less than 0.05 which indicates that confidence in their significance is more than 95%.
Let us look at an example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 | # Using the mlbench library to load diabetes data library(mlbench) data(PimaIndiansDiabetes) data_lm = as.data.frame(PimaIndiansDiabetes) # Fit a logistic regression model fit_glm = glm(diabetes~.,data_lm,family = “binomial”) # generate summary summary(fit_glm) Call: glm(formula = diabetes ~ ., family = “binomial”, data = data_lm) Deviance Residuals: Min 1Q Median 3Q Max –2.5566 –0.7274 –0.4159 0.7267 2.9297 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) –8.4046964 0.7166359 –11.728 < 2e–16 *** pregnant 0.1231823 0.0320776 3.840 0.000123 *** glucose 0.0351637 0.0037087 9.481 < 2e–16 *** pressure –0.0132955 0.0052336 –2.540 0.011072 * triceps 0.0006190 0.0068994 0.090 0.928515 insulin –0.0011917 0.0009012 –1.322 0.186065 mass 0.0897010 0.0150876 5.945 2.76e–09 *** pedigree 0.9451797 0.2991475 3.160 0.001580 ** age 0.0148690 0.0093348 1.593 0.111192 —– Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 993.48 on 767 degrees of freedom Residual deviance: 723.45 on 759 degrees of freedom AIC: 741.45 |
Number of Fisher Scoring iterations: 5
The output by logistic model gives us the estimates and probability values for each of the features. It also marks the important features with stars based on p-values.
For features whose class is a factor, the features are broken on the basis of each unique factor level. We see that the most important variables include glucose, mass and pregnant features for diabetes prediction. In this manner, regression models provide us with a list of important features.
Using The Caret Package to perform variable importance
R has a caret package which includes the varImp() function to calculate important features of almost all models.
Let’s compare our previous model summary with the output of the varImp() function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | # Using varImp() function library(caret) varImp(fit_glm) Overall pregnant 3.8401403 glucose 9.4813935 pressure 2.5404160 triceps 0.0897131 insulin 1.3223094 mass 5.9453340 pedigree 3.1595780 age 1.5928584 |
The varImp output ranks glucose to be the most important feature followed by mass and pregnant. This is exactly similar to the p-values of the logistic regression model.
However, varImp() function also works with other models such as random forests and can also give an idea of the relative importance using the importance score it generates.
Variable Importance Through Random Forest
Random forests are based on decision trees and use bagging to come up with a model over the data. Random forests also have a feature importance methodology which uses ‘gini index’ to assign a score and rank the features.
Let us see an example and compare it with varImp() function.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | # Import the random forest library and fit a model library(randomForest) fit_rf = randomForest(diabetes~., data=data_lm) # Create an importance based on mean decreasing gini importance(fit_rf) MeanDecreaseGini pregnant 29.11588 glucose 91.17223 pressure 30.88188 triceps 23.91996 insulin 24.79802 mass 56.83389 pedigree 42.83993 age 47.12770 |
1 2 3 4 5 6 7 8 9 10 11 12 13 | # compare the feature importance with varImp() function varImp(fit_rf) Overall pregnant 29.11588 glucose 91.17223 pressure 30.88188 triceps 23.91996 insulin 24.79802 mass 56.83389 pedigree 42.83993 age 47.12770 |
We see that the importance scores by varImp() function and the importance() function of random forest are exactly the same. If the model being used is random forest, we also have a function known as varImpPlot() to plot this data.
1 2 | # Create a plot of importance scores by random forest varImpPlot(fit_rf) |
These scores which are denoted as ‘Mean Decrease Gini’ by the importance measure represents how much each feature contributes to the homogeneity in the data. The way it works is as follows:
Each time a feature is used to split data at a node, the Gini index is calculated at the root node and at both the leaves. The Gini index represents the homogeneity and is 0 for completely homogeneous data and 1 for completely heterogeneous data.
The difference in the Gini index of the child nodes and the splitting root node is calculated for the feature and normalized.
Here, the nodes are also said to result in ‘purity’ of the data which means that the data is more easily classified. If the purity is high, the mean decrease in Gini index is also high.
Hence, the mean decrease in Gini index is highest for the most important feature.
Such features are useful in classifying the data and are likely to split the data into pure single class nodes when used at a node. Hence they are used first during splitting.
The overall mean decrease in Gini importance for each feature is thus calculated as the ratio of the sum of the number of splits in all trees that include the feature to the number of samples it splits.
This method is very useful to get importance scores and go a step further towards model interpretation.
Conclusion
Variable importance is usually followed by variable selection. Whether feature importance is generated before fitting the model (by methods such as correlation scores) or after fitting the model (by methods such as varImp() or Gini Importance), the important features not only give an insight on the features with high weightage and used frequently by the model but also the features which are slowing down our model.
This is why feature selection is used as it can improve the performance of the model. This is by removing predictors with chance or negative influence and provide faster and more cost-effective implementations by the decrease in the number of features going into the model.
To decide on the number of features to choose, one should come up with a number such that neither too few nor too many features are being used in the model.
For a methodology such as using correlation, features whose correlation is not significant and just by chance (say within the range of +/- 0.1 for a particular problem) can be removed.
For other methods such as scores by the varImp() function or importance() function of random forests, one should choose the features until which there is a sharp decline in importance scores.
In case of a large number of features (say hundreds or thousands), a more simplistic approach can be a cutoff score such as only the top 20 or top 25 features or the features such as the combined importance score crosses a threshold of 80% or 90% of the total importance score.
In the end, variable selection is a trade-off between the loss in complexity against the gain in execution speed that the project owners are comfortable with.
The methods mentioned in this article are meant to provide an overview of the ways in which variable importance can be calculated for a data. There can be other similar variable importance methods with their uses and implementations as per the situation.
Complete Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | # Use the library cluster generation to make a positive definite matrix of 15 features library(clusterGeneration) S = genPositiveDefMat(“unifcorrmat”,dim=15) #create 15 features using multivariate normal distribution for 5000 datapoints library(mnormt) n = 5000 X = rmnorm(n,varcov=S$Sigma) # Create a two class dependent variable using binomial distribution Y = rbinom(n,size=1,prob=0.3) data = data.frame(Y,X) # Create a correlation table for Y versus all features cor(data,data$Y) # Using the mlbench library to load diabetes data library(mlbench) data(PimaIndiansDiabetes) data_lm=as.data.frame(PimaIndiansDiabetes) # Fit a logistic regression model fit_glm=glm(diabetes~.,data_lm,family = “binomial”) # generate summary summary(fit_glm) # Using varImp() function library(caret) varImp(fit_glm) #Import the random forest library and fit a model library(randomForest) fit_rf=randomForest(diabetes~., data=data_lm) # Create an importance based on mean decreasing gini importance(fit_rf) # compare the feature importance with varImp() function varImp(fit_rf) # Create a plot of importance scores by random forest varImpPlot(fit_rf) |