Working in machine learning field is not only about building different classification or clustering models. It’s more about feeding the right set of features into the training models.

This process of feeding the right set of features into the model mainly take place after the data collection process.

Once we have enough data, We won’t feed entire data into the model and expect great results. We need to pre-process the data.

In fact, the challenging and the key part of machine learning processes is data preprocessing.

Below are the key things we indented to do in data preprocessing stage.

  • Feature transformation
  • Feature selection

Feature transformation is to transform the already existed features into other forms. Suppose using the logarithmic function to convert normal features to logarithmic features.

Feature selection is to select the best features out of already existed features. In this article, we are going to learn the basic techniques to pick the best features for modeling.

Before we drive further. Let’s have a look at the table of contents.

Table of contents:

  • Why modeling is not the final step
  • The role of correlation
  • Calculating feature importance with  regression methods
  • Using caret package to calculate feature importance
  • Random forest for calculating feature importance
  • Conclusion
  • Related courses
    • Exploratory data analysis in r
    • Machine learning A-Z in r

Why Modeling is Not The Final Step

Like a coin, every project has two sides.

  • Business side
  • Technical side

The technical side deals with data collection, processing and then implementing it to get results. The business side is what envelops the technical side.

business side and technical side

business side and technical side Image credit :

It starts with defining the requirements, hands it over to the technical team for generating results and then take over for converting those results into actionable insights. This is why it is necessary for both teams that they know what was implemented behind the scenes in the project.

The team handling the technical part may consider models and process as their core project deliverable but just running the model and getting highly accurate models is never the end goal of the project for the business team.

It is the understanding of the project which makes it actionable. Thus, if you make a model, but you don’t know what is happening around it then it is a black box which may be perfect for lab results but not something that can be put into the production.

While one may not be concerned with each and every detail of what is happening. One is definitely interested in what actionable insights can be derived out of the model. Using variable importance can help achieve this objective.

Most models have a method to generate variable importance which indicates what features are used in the model and how important they are. Variable importance also has a use in the feature selection process.

As the Occam’s Razor principle states.

The simplest models are the best.

Finding the best features to use in the model based on decreasing variable importance helps one to identify and select the features which produce 80% of the results and discard the rest of the variables which account for rest 20% of the accuracy.

Generally looking at variables (Features) one by one can also help in understanding what features are important and figuring out how do they contribute towards solving a business problem.

It is not difficult to derive variable importance based on the methodology being followed.This is why variable importance can be calculated in more than one way. It’s not a rocket science.

This article describes some such ways.

Role of Correlation

Correlation Coefficient

Correlation Coefficient | Image credit

If you are working with a model which assumes the linear relationship between the dependent variables, correlation can help you come up with an initial list of importance. It also works as a rough list for nonlinear models.

The idea is that those features which have a high correlation with the dependent variable are strong predictors when used in a model.

Let us generate a random dataset for this article.

Let us now create a dependent feature Y plot a correlation table for these features.

As expected, since we are using a randomly generated dataset, there is little correlation of Y with all other features. These numbers may be different for different runs.

In this case, the correlation for X11 seems to be the highest. Had we to necessarily use this data for modeling, X11 will be expected to have the maximum impact on predicting Y. In this way, the list of correlations with the dependent variable will be useful to get an idea of the features that impact the outcome.

While plotting correlations, we always assume that the features and dependent variable are numeric. If we are looking at Y as a class, we can also see the distribution of different features for every class of Y.

Using Regression to Calculate Variable Importance

The summary function in regression also describes features and how they affect the dependent feature through significance. It works on variance and marks all features which are significantly important.

Such features usually have a p-value less than 0.05 which indicates that confidence in their significance is more than 95%.

Let us look at an example:

Number of Fisher Scoring iterations: 5

The output by logistic model gives us the estimates and probability values for each of the features. It also marks the important features with stars based on p-values.

For features whose class is a factor, the features are broken on the basis of each unique factor level. We see that the most important variables include glucose, mass and pregnant features for diabetes prediction. In this manner, regression models provide us with a list of important features.

Using The Caret Package to perform  variable importance

R has a caret package which includes the varImp() function to calculate important features of almost all models.

Let’s compare our previous model summary with the output of the varImp() function.

The varImp output ranks glucose to be the most important feature followed by mass and pregnant. This is exactly similar to the p-values of the logistic regression model.

However, varImp() function also works with other models such as random forests and can also give an idea of the relative importance using the importance score it generates.

Variable Importance Through Random Forest

Random forests are based on decision trees and use bagging to come up with a model over the data. Random forests also have a feature importance methodology which uses ‘gini index’ to assign a score and rank the features.

Let us see an example and compare it with varImp() function.

We see that the importance scores by varImp() function and the importance() function of random forest are exactly the same. If the model being used is random forest, we also have a function known as varImpPlot() to plot this data.

These scores which are denoted as ‘Mean Decrease Gini’ by the importance measure represents how much each feature contributes to the homogeneity in the data. The way it works is as follows:

Each time a feature is used to split data at a node, the Gini index is calculated at the root node and at both the leaves. The Gini index represents the homogeneity and is 0 for completely homogeneous data and 1 for completely heterogeneous data.

The difference in the Gini index of the child nodes and the splitting root node is calculated for the feature and normalized.

Here, the nodes are also said to result in ‘purity’ of the data which means that the data is more easily classified. If the purity is high, the mean decrease in Gini index is also high.

Hence, the mean decrease in Gini index is highest for the most important feature.

Such features are useful in classifying the data and are likely to split the data into pure single class nodes when used at a node. Hence they are used first during splitting.

The overall mean decrease in Gini importance for each feature is thus calculated as the ratio of the sum of the number of splits in all trees that include the feature to the number of samples it splits.

This method is very useful to get importance scores and go a step further towards model interpretation.


Variable importance is usually followed by variable selection. Whether feature importance is generated before fitting the model (by methods such as correlation scores) or after fitting the model (by methods such as varImp() or Gini Importance), the important features not only give an insight on the features with high weightage and used frequently by the model but also the features which are slowing down our model.

This is why feature selection is used as it can improve the performance of the model. This is by removing predictors with chance or negative influence and provide faster and more cost-effective implementations by the decrease in the number of features going into the model.

To decide on the number of features to choose, one should come up with a number such that neither too few nor too many features are being used in the model.

For a methodology such as using correlation, features whose correlation is not significant and just by chance (say within the range of +/- 0.1 for a particular problem) can be removed.

For other methods such as scores by the varImp() function or importance() function of random forests, one should choose the features until which there is a sharp decline in importance scores.

In case of a large number of features (say hundreds or thousands), a more simplistic approach can be a cutoff score such as only the top 20 or top 25 features or the features such as the combined importance score crosses a threshold of 80% or 90% of the total importance score.

In the end, variable selection is a trade-off between the loss in complexity against the gain in execution speed that the project owners are comfortable with.

The methods mentioned in this article are meant to provide an overview of the ways in which variable importance can be calculated for a data. There can be other similar variable importance methods with their uses and implementations as per the situation.

Complete Code

Submit a Comment

Your email address will not be published. Required fields are marked *