The process of converting categorical data in a dataset into numerical data is known as feature encoding. Because most machine learning models can only comprehend numerical data and not data in written form, feature encoding is critical.
Types of Categorical Data:
1.Binary :
When two categories are there such as: Either/or, Yes/No
2.Ordinal:
Specific ordered Groups: low/ medium/ high , cold/hot/lava Hot
3.Nominal:
Unordered Groups.:cat/dog/tiger , pizza/burger/coke
Types of Encoding Techniques:
- Label Encoding:
It converts n categories into numbers from (0 to n)
EXAMPLE –
If we have a Height column with Values – Tall , Medium and Short.
Then after applying label encoding it will give us( 0 for Tall , 1 for Medium and 2 for Short) .
For Ordered data such as above its suggested to use label encoder but for nominal data( no-order) like ( Cities – Mumbai, Chennai, Delhi ) if we will apply label encoding it will give us ( 0 for Chennai, 1 for Mumbai and 2 for Delhi) and in that case Machine will understand that Delhi is higher than Mumbai, which is definitely not the case here.
- One Hot Encoding:
As seen in the above paragraph, it’s a bad idea to Label Encoding for nominal data where order has no meaning. It is where One Hot Encoding helps.
In One Hot Encoding what we do is we create (n-1) number of columns for n categories. And we provide 1 where that category is present and 0 where it is not.
Let’s understand it by an example
If We have a column(Fruits) with Categories
Then after applying One Hot Encoding we will get:
So for categorical variables of ordinal nature we generally use Label Encoder whereas for nominal ones we use One Hot Encoding Technique.