In any Machine Learning process, Data Preprocessing is that step in which the data gets transformed, or Encoded, to bring it to such a state that now the machine can easily parse it. In other words, the features of the data can now be easily interpreted by the algorithm.
There are numbers of methodologies of data preprocessing we will focus on below five.
- Data Encoding
- Imputation of missing values
Iris dataset has width and length of petal and sepal with flower species name.
Information of dataset
here 5 columns are numeric and 1 column has data which is type of object. In five numeric column one column is for id and others are for width and lenght of sepal and petal.
Encoding is the process of converting the data or a given sequence of characters, symbols, alphabets etc., into a specified format. In this we assign unique values to all the categorical attribute. like pass as 1 and fail as 0.
There are two types of encoding
- label encoding
Label Encoding refers to converting the labels into numeric form so as to convert it into the machine-readable form. Machine learning algorithms can then decide in a better way on how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.
As you can see ‘Species’ column has 3 categories of flower. After Using Label Encoder we labeled the data.
2. Onehot encoder
Though label encoding is straight but it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them. This ordering issue is addressed in another common alternative approach called ‘One-Hot Encoding’. In this strategy, each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column.
Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.
formula of Min-Max scaling
Standardization is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation.
formula for standardization
Imputation of missing values
Missing values are data that are not available in dataset. there can be single value can be missing or only on value is available and all others are missing.
here is example of simple imputation by adding mean value into missing values.
Discretization is the process through which we can transform continuous variables, models or functions into a discrete form. We do this by creating a set of contiguous intervals (or bins) that go across the range of our desired variable/model/function.
There are 3 types of Discretization available in Sci-kit learn.
- Quantile Discretization Transform
2. Uniform Discretization Transform
3. KMeans Discretization Transform