The RandomSampleImputer() replaces missing data with a random sample extracted from the variable. It works with both numerical and categorical variables. A list of variables can be indicated, or the imputer will automatically select all variables in the train set.
A seed can be set to a pre-defined number and all observations will be replaced in batch. Alternatively, a seed can be set using the values of 1 or more numerical variables. In this case, the observations will be imputed individually, one at a time, using the values of the variables as a seed.
For example, if the observation shows variables color: np.nan, height: 152, weight:52, and we set the imputer as:
RandomSampleImputer(random_state=['height', 'weight'], seed='observation', seeding_method='add'))
the observation will be replaced using pandas sample as follows:
More details on how to use the RandomSampleImputer():
import numpy as np import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split import feature_engine.missing_data_imputers as mdi # Load dataset data = pd.read_csv('houseprice.csv') # Separate into train and test sets X_train, X_test, y_train, y_test = train_test_split( data.drop(['Id', 'SalePrice'], axis=1), data['SalePrice'], test_size=0.3, random_state=0) # set up the imputer imputer = mdi.RandomSampleImputer(random_state=['MSSubClass', 'YrSold'], seed='observation', seeding_method='add') # fit the imputer imputer.fit(X_train) # transform the data train_t = imputer.transform(X_train) test_t = imputer.transform(X_test) fig = plt.figure() ax = fig.add_subplot(111) X_train['LotFrontage'].plot(kind='kde', ax=ax) train_t['LotFrontage'].plot(kind='kde', ax=ax, color='red') lines, labels = ax.get_legend_handles_labels() ax.legend(lines, labels, loc='best')
RandomSampleImputer(variables=None, random_state=None, seed='general', seeding_method='add')¶
The RandomSampleImputer() replaces missing data in each feature with a random sample extracted from the variables in the training set. The RandomSampleImputer() works with both numerical and categorical variables. Note: random samples will vary from execution to execution. This may affect the results of your work. Remember to set a seed before running the RandomSampleImputer().
There are 2 ways in which the seed can be set with the RandomSampleImputer(): If seed = ‘general’ then the random_state can be either None or an integer. The seed will be used as the random_state and all observations will be imputed in one go. This is equivalent to pandas.sample(n, random_state=seed).
If seed = ‘observation’, then the random_state should be a variable name or a list of variable names. The seed will be calculated, observation per observation, either by adding or multiplying the seeding variable values for that observation, and passed to the random_state. Thus, a value will be extracted using that seed, and used to replace that particular observation. This is the equivalent of pandas.sample(1, random_state=var1+var2) if the ‘seeding_method’ is set to ‘add’ or pandas.sample(1, random_state=var1*var2) if the ‘seeding_method’ is set to ‘multiply’.
For more details on why this functionality is important refer to the course Feature Engineering for Machine Learning in Udemy: https://www.udemy.com/feature-engineering-for-machine-learning/
Note, if the variables indicated in the random_state list are not numerical the imputer will return an error. Note also that the variables indicated as seed should not contain missing values.
This estimator stores a copy of the training set when the fit() method is called. Therefore, the object can become quite heavy. Also, it may not be GDPR compliant if your training data set contains Personal Information. Please check if this behaviour is allowed within your organisation. The imputer replaces missing data with a random sample from the training set.
random_state (int, str or list, default=None) – The random_state can take an integer to set the seed when extracting the random samples. Alternatively, it can take a variable name or a list of variables, which values will be used to determine the seed observation per observation.
seed (str, default='general') –
Indicates whether the seed should be set for each observation with missing values, or if one seed should be used to impute all variables in one go.
general: one seed will be used to impute the entire dataframe. This is equivalent to setting the seed in pandas.sample(random_state).
observation: the seed will be set for each observation using the values of the variables indicated in the random_state for that particular observation.
seeding_method (str, default='add') – If more than one variable are indicated to seed the random sampling per observation, you can choose to combine those values as an addition or a multiplication. Can take the values ‘add’ or ‘multiply’.
variables (list, default=None) – The list of variables to be imputed. If None, the imputer will select all variables in the train set.
Makes a copy of the variables to impute in the training dataframe from which it will randomly extract the values to fill the missing data during transform.
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just he variables to impute.
y (None) – y is not needed in this imputation. You can pass None or y.
Copy of the training dataframe from which to extract the random samples.
Replaces missing data with random values taken from the train set.
X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.
X_transformed – The dataframe without missing values in the transformed variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]