Jay Patel
4 min readOct 29, 2021

AIM:-Data reduction using variance threshold, univariate feature selection, recursive feature elimination, PCA

This blog is about how we perform data reduction using variance threshold, univariate feature selection, recursive feature elimination, PCA.

About the dataset:-

For this practical, I have used the “Iris” dataset which is loaded by sklearn.

Fig.Load dataset using sklearn

After that let see the dataset information.

Fig.print information

The data have four features. To test the effectiveness of different feature selection methods, we add some noise features to the data set.

Fig. feature selection

Before applying the feature selection method, we need to split the data first. The reason is that we only select features based on the information from the training set, not on the whole data set.

Fig. Split the dataset into test and train set

Variance Threshold:-

VarianceThreshold is a Feature selector that removes all low-variance features. This feature selection algorithm looks only at the features (X), not the desired outputs (y), and can thus be used for unsupervised learning. Features with a training-set variance lower than this threshold will be removed.

Fig. VarianceThreshold

Univariate Feature Selection:-

Univariate feature selection works by selecting the best features based on univariate statistical tests. We compare each feature to the target variable, to see whether there is any statistically significant relationship between them. It is also called the analysis of variance (ANOVA). … That is why it is called ‘univariate’.

  1. f_classif:-

Fig. F_classif Univariate Feature Selection

2. chi2:-

Fig. chi2 Univariate Feature Selection

3. mutual_info_classif:-

Fig. mutual_info_classif Univariate Feature Selection

Recursive Feature Elimination:-

Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached. … RFE requires a specified number of features to keep, however, it is often not known in advance how many features are valid.

Fig. Recursive feature elimination

Differences Between Before and After Using Feature Selection:-

a. Before using Feature Selection

Fig. Before using Feature Selection

b. After using Feature Selection

Fig. After using Feature Selection

Principal Component Analysis (PCA):-

The principal components of a collection of points in real coordinate space are a sequence of p unit vectors, where the i-th vector is the direction of a line that best fits the data while being orthogonal to the first i-1 vectors.

Fig. Importing PCA using sklearn

PCA Projection to 2D

The original data has 4 columns (sepal length, sepal width, petal length, and petal width). In this section, the code projects the original data which is 4 dimensional into 2 dimensions. The new components are just the two main dimensions of variation.

Fig.PCA Projection to 2D

Concatenating DataFrame along axis = 1. finalDf is the final DataFrame before plotting the data.

Fig. Concatenating DataFrame along axis = 1

Now, lets visualize the data frame, execute the following code:

Fig . visualization of the data frame

Now lets visualize 3D graph,

Fig.Code for 3D visulization

Fig. Image of 3D visulization


I hope you will understand these things…

Github Link:-