DropMissingData¶
API Reference¶
-
class
feature_engine.imputation.
DropMissingData
(missing_only=True, variables=None)[source]¶ works for both numerical and categorical variables. DropMissingData can automatically select all the variables, or alternatively, all the variables with missing data in the train set. Then the observations with NA will be dropped for these variable groups.
The user has the option to indicate for which variables the observations with NA should be removed.
- Parameters
- missing_onlybool, default=True
If true, missing observations will be dropped only for the variables that were seen to have NA in the train set, during fit. If False, observations with NA will be dropped from all variables.
- variableslist, default=None
The list of variables to be imputed. If None, the imputer will find and select all variables with missing data.
- **Note**
- The transformer will first select all variables or all user entered
- variables and if `missing_only=True`, it will re-select from the original group
- only those that show missing data in during fit, that is in the train set.
Attributes
variables_:
List of variables for which the rows with NA will be deleted.
Methods
fit:
Learn the variables for which the rows with NA will be deleted
transform:
Remove observations with NA
fit_transform:
Fit to the data, then transform it.
return_na_data:
Returns the dataframe with the rows that contain NA .
-
fit
(X, y=None)[source]¶ Learn the variables for which the rows with NA will be deleted.
- Parameters
- Xpandas dataframe of shape = [n_samples, n_features]
The training dataset.
- ypandas Series, default=None
y is not needed in this imputation. You can pass None or y.
- Returns
- self
- Raises
- TypeError
If the input is not a Pandas DataFrame
-
return_na_data
(X)[source]¶ Returns the subset of the dataframe which contains the rows with missing values. This method could be useful in production, in case we want to store the observations that will not be fed into the model.
- Parameters
- Xpandas dataframe of shape = [n_samples, n_features]
The dataset to from which rows containing NA should be retained.
- Returns
- Xpandas dataframe of shape = [obs_with_na, features]
The cdataframe portion that contains only the rows with missing values.
- rtype
DataFrame
..
- Raises
- TypeError
If the input is not a Pandas DataFrame
-
transform
(X)[source]¶ Remove rows with missing values.
- Parameters
- Xpandas dataframe of shape = [n_samples, n_features]
The dataframe to be transformed.
- Returns
- X_transformedpandas dataframe
The complete case dataframe for the selected variables, of shape [n_samples - rows_with_na, n_features]
- rtype
DataFrame
..
Example¶
DropMissingData() deletes rows with NA values. It works with numerical and categorical variables. The user can pass a list of variables for which to delete rows with NA. Alternatively, DropMissingData() will default to all variables. The trasformer has the option to learn the variables with NA in the train set, and then remove observations with NA in only those variables.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from feature_engine.imputation import DropMissingData
# Load dataset
data = pd.read_csv('houseprice.csv')
# Separate into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
data.drop(['Id', 'SalePrice'], axis=1),
data['SalePrice'],
test_size=0.3,
random_state=0)
# set up the imputer
missingdata_imputer = DropMissingData(variables=['LotFrontage', 'MasVnrArea'])
# fit the imputer
missingdata_imputer.fit(X_train)
# transform the data
train_t= missingdata_imputer.transform(X_train)
test_t= missingdata_imputer.transform(X_test)
# Number of NA before the transformation:
X_train['LotFrontage'].isna().sum()
189
# Number of NA after the transformation:
train_t['LotFrontage'].isna().sum()
0
# Number of rows before and after transformation
print(X_train.shape)
print(train_t.shape)
(1022, 79)
(829, 79)