# OneHotEncoder#

The `OneHotEncoder()`

performs one hot encoding. One hot encoding consists in
replacing the categorical variable by a group of binary variables which take value 0 or
1, to indicate if a certain category is present in an observation. The binary variables
are also known as dummy variables.

For example, from the categorical variable “Gender” with categories “female” and
“male”, we can generate the boolean variable “female”, which takes 1 if the
observation is female or 0 otherwise. We can also generate the variable “male”,
which takes 1 if the observation is “male” and 0 otherwise. By default, the
`OneHotEncoder()`

will return both binary variables from “Gender”: “female” and
“male”.

**Binary variables**

When a categorical variable has only 2 categories, like “Gender” in our previous example, then
the second dummy variable created by one hot encoding can be completely redundant. We
can drop automatically the last dummy variable for those variables that contain only 2
categories by setting the parameter `drop_last_binary=True`

. This will ensure that for
every binary variable in the dataset, only 1 dummy is created. This is recommended,
unless we suspect that the variable could, in principle take more than 2 values.

**k vs k-1 dummies**

From a categorical variable with k unique categories, the `OneHotEncoder()`

can
create k binary variables, or alternatively k-1 to avoid redundant information. This
behaviour can be specified using the parameter `drop_last`

. Only k-1 binary variables
are necessary to encode all of the information in the original variable. However, there
are situations in which we may choose to encode the data into k dummies.

Encode into k-1 if training linear models: Linear models evaluate all features during fit, thus, with k-1 they have all information about the original categorical variable.

Encode into k if training decision trees or performing feature selection: tree based models and many feature selection algorithms evaluate variables or groups of variables separately. Thus, if encoding into k-1, the last category will not be examined. That is, we lose the information contained in that category.

**Encoding only popular categories**

The encoder can also create binary variables for the n most popular categories, n being
determined by the user. For example, if we encode only the 6 more popular categories, by
setting the parameter `top_categories=6`

, the transformer will add binary variables only
for the 6 most frequent categories. The most frequent categories are those with the biggest
number of observations. The remaining categories will not be encoded into dummies. Thus,
if an observation presents a category other than the most frequent ones, it will have a
0 value in each one of the derived dummies. This behaviour is useful when the categorical
variables are highly cardinal, to control the expansion of the feature space.

**Note**

Only when creating binary variables for all categories of the variable (instead of the most popular ones), we can specify if we want to encode into k or k-1 binary variables, where k is the number if unique categories. If we encode only the top n most popular categories, the encoder will create only n binary variables per categorical variable. Observations that do not show any of these popular categories, will have 0 in all the binary variables.

Let’s look at an example using the Titanic Dataset. First we load the data and divide it into a train and a test set:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from feature_engine.encoding import OneHotEncoder
# Load dataset
def load_titanic():
data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
data = data.replace('?', np.nan)
data['cabin'] = data['cabin'].astype(str).str[0]
data['pclass'] = data['pclass'].astype('O')
data['embarked'].fillna('C', inplace=True)
return data
data = load_titanic()
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['survived', 'name', 'ticket'], axis=1),
data['survived'], test_size=0.3, random_state=0)
```

Now, we set up the encoder to encode only the 2 most frequent categories of each of the 3 indicated categorical variables:

```
# set up the encoder
encoder = OneHotEncoder(top_categories=2, variables=['pclass', 'cabin', 'embarked'])
# fit the encoder
encoder.fit(X_train)
```

With `fit()`

the encoder will learn the most popular categories of the variables, which
are stored in the attribute `encoder_dict_`

.

```
encoder.encoder_dict_
```

```
{'pclass': [3, 1], 'cabin': ['n', 'C'], 'embarked': ['S', 'C']}
```

The `encoder_dict_`

contains the categories that will derive dummy variables for each
categorical variable.

With transform, we go ahead and encode the variables. Note that by default, the
`OneHotEncoder()`

will drop the original variables.

```
# transform the data
train_t= encoder.transform(X_train)
test_t= encoder.transform(X_test)
```

If you do not want to drop the original variables, consider using the OneHotEncoder from Scikit-learn and wrap it with the SklearnTransformerWrapper.

**Feature space and duplication**

If the categorical variables are highly cardinal, we may end up with very big datasets after one hot encoding. In addition, if some of these variables are fairly constant or fairly similar, we may end up with one hot encoded features that are highly correlated if not identical.

Consider checking this up and dropping redundant features with the transformers from the selection module.

## More details#

For more details into `OneHotEncoder()`

’s functionality visit:

All notebooks can be found in a dedicated repository.