Data Science:-Data Preprocessing with Orange Tool

Jay Patel
3 min readOct 12, 2021

--

This blog is all about data preprocessing with the orange tool. In this blog, I will be discussing how you can use the Orange library in python and perform various data preprocessing tasks like Discretization, Continuization, Randomization, and Normalization on data with help of various Orange functions.

First, we will open the python script in the orange tool.

Discretization:-

Discretization methods are used to chop a continuous function (i.e., the real solution to a system of differential equations in CFD) into a discrete function, where the solution values are defined at each point in space and time. Discretization simply refers to the spacing between each point in your solution space.

Discretization using python script

Continuization:-

Given a data table, return a new table in which the discretize attributes are replaced with continuous or removed.

  • binary variables are transformed into 0.0/1.0 or -1.0/1.0 indicator variables, depending upon the argument zero_based.
  • multinomial variables are treated according to the argument multinomial_treatment.
  • discrete attribute with only one possible value are removed;

Continuize_Indicators

The variable is replaced by indicator variables, each corresponding to one value of the original variable. For each value of the original attribute, only the corresponding new attribute will have a value of one, and the others will be zero. This is the default behavior.

For example, as shown in the below code snippet, dataset “titanic” has featured “status” with values “crew”, “first”, “second” and “third”, in that order. Its value for the 10th row is “first”. Continuization replaces the variable with variables “status=crew”, “status=first”, “status=second” and “status=third”.

Continuization Script

Normalization:-

Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.

Normalization using python script

Randomization:-

Randomization is the process of making something random; in various contexts, this involves, for example: generating a random permutation of a sequence (such as when shuffling cards); selecting a random sample of a population (important in statistical sampling).

Randomization using python script

Conclusion:-

I hope you will understand these things…

--

--