By **Dr. Shripad Bhat**, Data Scientist

September 4, 2019

## Data Preprocessing - Creating Dummy Variables and Converting Ordinal Variables to Numbers with Examples

- Handling missing values
- Handling outliers
- Transforming nominal variables to dummy variables (discussed in this post)
- Converting ordinal data to numbers (discussed in this post)
- Transformation (discussed in this post)

## I) Transforming nominal variables to dummy variables

There are many ways of creating dummy variables in python. I will be using pandas get_dummies function in the following example.

## Example

Sample data for dummy variable creation

- prefix (which will used in naming the dummy variables)

I used prefix=’City’, hence the newly created dummy variables bear the name City_Mumbai, City_New Delhi.

**drop_first =True**

This will create k-1 dummy variables for k categories (in this case 4, because there are 4 unique city names: New Delhi, Mumbai, Bengaluru and Xyz ) to avoid dummy variable trap in some of the machine learning models such as regression. Since we have set drop_first =True, pandas will create k-1=4-1=3 dummy variables as shown in the picture below. If we don’t specify **drop_first **option, it will create k dummy variables (i.e. one each for each cities, in our case, four dummy variables).

One thing to note here is that, since get_dummies option creates dummy variables depending on number of categories present in given data, we cannot assume it will create same number of dummy variables for both training and test data. Why? Because number of categories present in training and test data may be different.

## II) Converting ordinal data to numbers

There are several ways to convert categories into numbers (like 1, 2, 3). Find and replace is one such option. Ordinal Encoder is another option in scikit-learn v0.20.0. I have used the function suggested by Chris Albon. We can define a function named category_to_numeric and apply it as shown in the picture below. A new column named ‘Population_num’ is created.

## III) Transformation

Log transformation is useful when data is right skewed

## Example

In Python, I have created a sample dataset which is slightly right skewed. You can see the effect of log transformation on the skewness of the distribution in the graphs.

## Summary

- Transforming nominal variables to dummy variables
- Converting ordinal data to numbers
- Transformation