ImputationKernel Class
- class miceforest.imputation_kernel.ImputationKernel(data: DataFrame, num_datasets: int = 1, variable_schema: List[str] | Dict[str, List[str]] | None = None, imputation_order: Literal['ascending', 'descending', 'roman', 'latin'] = 'ascending', mean_match_candidates: int | Dict[str, int] = 5, mean_match_strategy: str | Dict[str, str] | None = 'normal', data_subset: int | Dict[str, int] = 0, initialize_empty: bool = False, save_all_iterations_data: bool = True, copy_data: bool = True, random_state: int | RandomState | None = None)
Bases:
ImputedDataCreates a kernel dataset. This dataset can perform MICE on itself, and impute new data from models obtained during MICE.
- Parameters:
data (pandas.DataFrame.) – The data to be imputed.
variable_schema (None or List[str] or Dict[str, str], default=None) –
Specifies the feature - target relationships used to train models. This parameter also controls which models are built. Models can be built even if a variable contains no missing values, or is not being imputed.
If
None, all columns with missing values will have models trained, and all columns will be used as features in these models.If
List[str], all columns in data are used to impute the variables in the listIf
Dict[str, str]the values will be used to impute the keys.
No models will be trained for variables not specified by variable_schema (either by None, a list, or in dict keys).
imputation_order (str, default="ascending") –
The order the imputations should occur in:
ascending: variables are imputed from least to most missingdescending: most to least missingroman: from left to right in the datasetarabic: from right to left in the dataset.
data_subset (None or int or Dict[str, int], default=0) –
Subsets the data used to train the model for each variable, which can save a significant amount of time. The number of rows used for model training and mean matching (candidates) is
(# rows in raw data) - (# missing variable values)for each variable.data_subsettakes a random sample from these candidates.If
int, must be >= 0. Interpreted as the number of candidates.If
0, no subsetting is done.If
Dict[str, int], keys must be variable names, and values must follow two above rules.
This can also help with memory consumption, as the candidate data must be copied to make a feature dataset for lightgbm. It is recommended to carefully select this value for each variable if dealing with very large data that barely fits into memory.
mean_match_strategy (str or Dict[str, str], default="normal") –
There are 3 mean matching strategies included in miceforest:
normal- this is the default. For all predictions, K-nearest-neighbors is performed on the candidate predictions and bachelor predictions. The top MMC closest candidate values are chosen at random.fast- Only available for categorical and binary columns. A value is selected at random weighted by the class probabilities.shap- Similar to “normal” but more robust. A K-nearest-neighbors search is performed on the shap values of the candidate predictions and the bachelor predictions. A value from the top MMC closest candidate values is chosen at random.
A dict of strategies by variable can be passed as well. Any unmentioned variables will be set to the default, “normal”.
mean_match_strategy = { 'column_1': 'fast', 'column_2': 'shap', }
Special rules are enacted when
mean_match_candidates==0for a variable. See the mean_match_candidates parameter for more information.mean_match_candidates (int or Dict[str, int]) –
The number of nearest neighbors to choose an imputation value from randomly when mean matching.
Special rules apply when this value is set to 0. This will skip mean matching entirely. The algorithm that applies depends on the objective type:
Regression: The bachelor predictions are used as the imputation values.Binary: The class with the higher probability is chosen.Multiclass: The class with the highest probability is chosen.
Setting mmc to 0 will result in much faster process times, but has a few downsides:
Imputation values for regression variables might no longer be valid values. Mean matching ensures that the imputed values have been realized in the data before.
Random variability from mean matching is often desired to get a more accurate view of the variability in imputed “confidence”
initialize_empty (bool, default=False) – If
True, missing data is not filled in randomly before model training starts.save_all_iterations_data (bool, default=True) – Setting to False will cause the process to not store the models and candidate values obtained at each iteration. This can save significant amounts of memory, but it means
impute_new_data()will not be callable.copy_data (bool, default=True) – Should the dataset be referenced directly? If False, this will cause the dataset to be altered in place. If a copy is created, it is saved in self.working_data. There are different ways in which the dataset can be altered.
random_state (None, int, or numpy.random.RandomState) – The random_state ensures script reproducibility. It only ensures reproducible results if the same script is called multiple times. It does not guarantee reproducible results at the record level if a record is imputed multiple different times. If reproducible record-results are desired, a seed must be passed for each record in the
random_seed_arrayparameter.
- complete_data(dataset: int = 0, iteration: int = -1, inplace: bool = False, variables: List[str] | None = None)
Return dataset with missing values imputed.
- Parameters:
dataset (int) – The dataset to complete.
iteration (int) – Impute data with values obtained at this iteration. If
-1, returns the most up-to-date iterations, even if different between variables. If not -1, iteration must have been saved in imputed values.inplace (bool) – Should the data be completed in place? If True, self.working_data is imputed,and nothing is returned. This is useful if the dataset is very large. If False, a copy of the data is returned, with missing values imputed.
- Return type:
The completed data, with values imputed for specified variables.
- fit(X, y, **fit_params)
Method for fitting a kernel when used in a sklearn pipeline. Should not be called by the user directly.
- get_feature_importance(dataset: int = 0, iteration: int = -1, importance_type: str = 'split', normalize: bool = True) DataFrame
Return a matrix of feature importance. The cells represent the normalized feature importance of the columns to impute the rows. This is calculated internally by lightgbm.Booster.feature_importance().
- Parameters:
dataset (int) – The dataset to get the feature importance for.
iteration (int) – The iteration to return the feature importance for. The model must be saved to return importance. Use -1 to specify the latest iteration.
importance_type (str) – Passed to
lgb.feature_importance()normalize (bool) – Whether to normalize the values within each modeled variable to sum to 1.
- Return type:
pandas.DataFrame of importance values. Rows are imputed variables, and columns are predictor variables.
- get_model(variable: str, dataset: int, iteration: int = -1)
Returns the model trained for the specified variable, dataset, iteration. Model must have been saved.
- Parameters:
variable (str) – The variable
dataset (int) – The dataset
iteration (str) – The iteration. Use -1 for the latest.
- impute_new_data(new_data: DataFrame, datasets: List[int] | None = None, iterations: int | None = None, save_all_iterations_data: bool = True, copy_data: bool = True, random_state: int | RandomState | None = None, random_seed_array: ndarray | None = None, verbose: bool = False) ImputedData
Impute a new dataset
Uses the models obtained while running MICE to impute new data, without fitting new models. Pulls mean matching candidates from the original data.
save_models must be > 0. If save_models == 1, the last model obtained in mice is used for every iteration. If save_models > 1, the model obtained at each iteration is used to impute the new data for that iteration. If specified iterations is greater than the number of iterations run so far using mice, the last model is used for each additional iteration.
Type checking is not done. It is up to the user to ensure that the kernel data matches the new data being imputed.
- Parameters:
new_data (pandas.DataFrame) – The new data to impute
datasets (int or List[int], default = None) – The datasets from the kernel to use to impute the new data. If
None, all datasets from the kernel are used.iterations (int, default=None) – The number of iterations to run. If
None, the same number of iterations run so far in mice is used.save_all_iterations_data (bool, default=True) – Should the imputation values of all iterations be archived? If
False, only the latest imputation values are saved.copy_data (boolean, default=True) – Should the dataset be referenced directly? This will cause the dataset to be altered in place.
random_state (None or int or np.random.RandomState (default=None)) – The random state of the process. Ensures reproducibility. If
None, the random state of the kernel is used. Beware, this permanently alters the random state of the kernel and ensures non-reproduceable results, unless the entire process up to this point is re-run.random_seed_array (None or np.ndarray[uint32, int32, uint64]) –
Record-level seeds.
Ensures deterministic imputations at the record level. random_seed_array causes deterministic imputations for each record no matter what dataset each record is imputed with, assuming the same number of iterations and datasets are used. If
random_seed_arrayis passed, random_state must also be passed.- Record-level imputations are deterministic if the following conditions are met:
The associated value in
random_seed_arrayis the same.The same kernel is used.
The same number of iterations are run.
The same number of datasets are run.
Note: Using this parameter may change the global numpy seed by calling
np.random.seed()verbose (boolean, default=False) – Should information about the process be printed?
- Return type:
miceforest.ImputedData
- iteration_count(dataset: slice | int = slice(None, None, None), variable: slice | str = slice(None, None, None))
Grabs the iteration count for specified variables, datasets. If the iteration count is not consistent across the provided datasets/variables, an error will be thrown. Providing None will use all datasets/variables.
This is to ensure the process is in a consistent state when the iteration count is needed.
- Parameters:
datasets (None or int) – The datasets to check the iteration count for. If
None, all datasets are assumed (and assured) to have the same iteration count, otherwise error.variables (str or None) – The variable to check the iteration count for. If
None, all variables are assumed (and assured) to have the same iteration count, otherwise error.
- Return type:
An integer representing the iteration count.
- mice(iterations: int, verbose: bool = False, variable_parameters: Dict[str, Any] = {}, **kwlgb)
Perform MICE on a given dataset.
Multiple Imputation by Chained Equations (MICE) is an iterative method which fills in (imputes) missing data points in a dataset by modeling each column using the other columns, and then inferring the missing data.
For more information on MICE, and missing data in general, see Stef van Buuren’s excellent online book: https://stefvanbuuren.name/fimd/ch-introduction.html
For detailed usage information, see this project’s README on the github repository: https://github.com/AnotherSamWilson/miceforest
- Parameters:
iterations (int) – The number of iterations to run.
verbose (bool) – Should information about the process be printed?
variable_parameters (None or dict) –
Model parameters can be specified by variable here. Keys should be variable names or indices, and values should be a dict of parameter which should apply to that variable only.
variable_parameters = { 'column': { 'min_sum_hessian_in_leaf: 25.0, 'extra_trees': True, } }
kwlgb – Additional parameters to pass to lightgbm. Applied to all models.
- plot_feature_importance(dataset, importance_type: str = 'split', normalize: bool = True, iteration: int = -1)
Plot the feature importance. See get_feature_importance() for more details.
- Parameters:
dataset (int) – The dataset to plot the feature importance for.
importance_type (str) – Passed to lgb.feature_importance()
normalize (book) – Should the values be normalize from 0-1? If False, values are raw from Booster.feature_importance()
kw_plot – Additional arguments sent to sns.heatmap()
- plot_imputed_distributions(variables: List[str] | None = None, iteration: int = -1)
Plot the imputed value distributions. Red lines are the distribution of original data Black lines are the distribution of the imputed values.
- Parameters:
datasets (None, int, list[int])
variables (None, list[str]) – The variables to plot. If None, all numeric variables are plotted.
iteration (int) – The iteration to plot the distribution for. If None, the latest iteration is plotted. save_all_iterations must be True if specifying an iteration.
adj_args – Additional arguments passed to plt.subplots_adjust()
- plot_mean_convergence(variables: List[str] | None = None)
Plots the average value and standard deviation of imputations over each iteration. The lines show the average imputation value for a dataset over the iteration. The bars show the average standard deviation of the imputation values within datasets.
- Parameters:
variables (Optional[List[str]], default=None) – The variables to plot. By default, all numeric, imputed variables are plotted.
- transform(X, y=None)
Method for calling a kernel when used in a sklearn pipeline. Should not be called by the user directly.
- tune_parameters(dataset: int = 0, variables: List[str] | None = None, variable_parameters: Dict[str, Any] = {}, parameter_sampling_method: Literal['random'] = 'random', max_reattempts: int = 5, use_gbdt: bool = True, nfold: int = 10, optimization_steps: int = 5, random_state: int | RandomState | None = None, verbose: bool = False, **kwargs)
Perform hyperparameter tuning on models at the current iteration. This method is not meant to be robust, but to get a decent set of parameters to help with imputation. A few notes:
The parameters are tuned on the data that would currently be returned by complete_data(dataset). It is usually a good idea to run at least 1 iteration of mice with the default parameters to get a more accurate idea of the real optimal parameters, since Missing At Random (MAR) data imputations tend to converge over time.
num_iterations is treated as the maximum number of boosting rounds to run in lightgbm.cv. It is NEVER optimized. The num_iterations that is returned is the best_iteration returned by lightgbm.cv. num_iterations can be passed to limit the boosting rounds, but the returned value will always be obtained from best_iteration.
- lightgbm parameters are chosen in the following order of priority:
Anything specified in variable_parameters
Parameters specified globally in **kwbounds
Default tuning space (miceforest.default_lightgbm_parameters)
Default parameters (miceforest.default_lightgbm_parameters.default_parameters)
See examples for a detailed run-through. See https://github.com/AnotherSamWilson/miceforest#Tuning-Parameters for even more detailed examples.
- Parameters:
dataset (int (required)) – The dataset to run parameter tuning on. Tuning parameters on 1 dataset usually results in acceptable parameters for all datasets. However, tuning results are still stored seperately for each dataset.
variables (None or List[str]) –
If None, default hyper-parameter spaces are selected based on kernel data, and all variables with missing values are tuned.
If list, must either be indexes or variable names corresponding to the variables that are to be tuned.
variable_parameters (None or dict) –
Defines the tuning space. Dict keys must be variable names or indices, and a subset of the variables parameter. Values must be a dict with lightgbm parameter names as keys, and values that abide by the following rules:
scalar: If a single value is passed, that parameter will be used to build the model, and will not be tuned.
tuple: If a tuple is passed, it must have length = 2 and will be interpreted as the bounds to search within for that parameter.
list: If a list is passed, values will be randomly selected from the list.
example: If you wish to tune the imputation model for the 4th variable with specific bounds and parameters, you could pass:
variable_parameters = { 'column': { 'learning_rate: 0.01', 'min_sum_hessian_in_leaf: (0.1, 10), 'extra_trees': [True, False] } }
All models for variable ‘column’ will have a learning_rate = 0.01. The process will randomly search within the bounds (0.1, 10) for min_sum_hessian_in_leaf, and extra_trees will be randomly selected from the list. Also note, the variable name for the 4th column could also be passed instead of the integer 4. All other variables will be tuned with the default search space, unless **kwbounds are passed.
parameter_sampling_method (str) – If
random, parameters are randomly selected. Other methods will be added in future releases.max_reattempts (int) – The maximum number of failures (or non-learners) before the process stops, and moves to the next variable. Failures can be caused by bad parameters passed to lightgbm. Non-learners occur when trees cannot possibly be built (i.e. if
min_data_in_leaf > dataset.shape[0]).use_gbdt (bool) – Whether the models should use gradient boosting instead of random forests. If True, the optimal number of iterations will be found in lgb.cv, along with the other parameters.
nfold (int) – The number of folds to perform cross validation with. More folds takes longer, but Gives a more accurate distribution of the error metric.
optimization_steps (int) – How many steps to run the process for.
random_state (int or np.random.RandomState or None (default=None)) – The random state of the process. Ensures reproduceability. If None, the random state of the kernel is used. Beware, this permanently alters the random state of the kernel and ensures non-reproduceable results, unless the entire process up to this point is re-run.
verbose (bool) – Whether to print progress.
kwbounds – Any additional arguments that you want to apply globally to every variable. For example, if you want to limit the number of iterations, you could pass num_iterations = x to this functions, and it would apply globally. Custom bounds can also be passed.
- Returns:
optimal_parameters – A dict of the optimal parameters found for each variable. This can be passed directly to the
variable_parametersparameter inmice()- Return type:
dict