Data Preprocessing - Creating Dummy Variables and Converting Ordinal Variables to Numbers with Examples
I) Transforming nominal variables to dummy variables
There are many ways of creating dummy variables in python. I will be using pandas get_dummies function in the following example.
Sample data for dummy variable creation
- prefix (which will used in naming the dummy variables)
I used prefix=’City’, hence the newly created dummy variables bear the name City_Mumbai, City_New Delhi.
- drop_first =True
This will create k-1 dummy variables for k categories (in this case 4, because there are 4 unique city names: New Delhi, Mumbai, Bengaluru and Xyz ) to avoid dummy variable trap in some of the machine learning models such as regression. Since we have set drop_first =True, pandas will create k-1=4-1=3 dummy variables as shown in the picture below. If we don’t specify drop_first option, it will create k dummy variables (i.e. one each for each cities, in our case, four dummy variables).
One thing to note here is that, since get_dummies option creates dummy variables depending on number of categories present in given data, we cannot assume it will create same number of dummy variables for both training and test data. Why? Because number of categories present in training and test data may be different.
II) Converting ordinal data to numbers
There are several ways to convert categories into numbers (like 1, 2, 3). Find and replace is one such option. Ordinal Encoder is another option in scikit-learn v0.20.0. I have used the function suggested by Chris Albon. We can define a function named category_to_numeric and apply it as shown in the picture below. A new column named ‘Population_num’ is created.
Log transformation is useful when data is right skewed
In Python, I have created a sample dataset which is slightly right skewed. You can see the effect of log transformation on the skewness of the distribution in the graphs.
- Transforming nominal variables to dummy variables
- Converting ordinal data to numbers