Feature Selection using sklearn

In this post, we will understand how to perform Feature Selection using sklearn.

1) Dropping features which have low variance

If any features have low variance, they may not contribute in the model. For example, in the following dataset, features “Offer” and “Online payment” have zero variance, that means all the values are same. These two features can be dropped without any negative impact on the model to be built.

A) Dropping features with zero variance

If a feature has same values across all observations, then we can remove that variable. In the following example, two features can be removed.

Using the following code, we can retain only the variables with non-zero variance.

B) Dropping features with variance below the threshold variance

In the following example, dataset contains five features, out of which two features, “Referred” and “Repeat” do not vary much. Since data which contains values 0 and 1 are Bernoulli random variables, variance is given by the formula: p(1-p).

If we want to retain a feature which contains only 0s 80% of the time or only 1s 80% of the time, then the variance of that feature would be: 0.8*(1-0.8)= 0.16.

We can mention VarianceThreshold(threshold=(.8 * (1 – .8))) or VarianceThreshold(threshold=0.16).

2) Univariate feature selection

In this type of selection method, a score is computed to capture the importance of feature. Score can be calculated using different measures such as Chi-square, F value, mutual information etc.

The following are the some of the options available in univariate feature selection.

Let us use the example provided by sklearn to understand how univariate feature selection works.

In the following example, original iris dataset contains four predictors.