miceforest.ImputationKernel

class miceforest.ImputationKernel(data, datasets=1, variable_schema=None, imputation_order='ascending', train_nonmissing=False, mean_match_scheme=None, data_subset=None, categorical_feature='auto', initialization='random', save_all_iterations=True, save_models=1, copy_data=True, save_loggers=False, random_state=None)[source]

Bases: miceforest.ImputedData.ImputedData

Creates a kernel dataset. This dataset can perform MICE on itself, and impute new data from models obtained during MICE.

Parameters
  • data (np.ndarray or pandas DataFrame.) –

    The data to be imputed.
    

  • variable_schema (None or list or dict, default=None) –

    Specifies the feature - target relationships used to train models.
    This parameter also controls which models are built. Models can be built
    even if a variable contains no missing values, or is not being imputed
    (train_nonmissing must be set to True).
    
        - If None, all columns will be used as features in the training of each model.
        - If list, all columns in data are used to impute the variables in the list
        - If dict the values will be used to impute the keys. Can be either column
            indices or names (if data is a pd.DataFrame).
    
    No models will be trained for variables not specified by variable_schema
    (either by None, a list, or in dict keys).
    

  • imputation_order (str, list[str], list[int], default="ascending") –

    The order the imputations should occur in. If a string from the
    items below, all variables specified by variable_schema with
    missing data are imputed:
        ascending: variables are imputed from least to most missing
        descending: most to least missing
        roman: from left to right in the dataset
        arabic: from right to left in the dataset.
    If a list is provided:
        - the variables will be imputed in that order.
        - only variables with missing values should be included in the list.
        - must be a subset of variables specified by variable_schema.
    If a variable with missing values is in variable_schema, but not in
    imputation_order, then models to impute that variable will be trained,
    but the actual values will not be imputed. See examples for details.
    

  • train_nonmissing (boolean) –

    Should models be trained for variables with no missing values? Useful if you
    expect you will need to impute new data which will have missing values, but
    the training data is fully recognized.
    
    If True, parameters are interpreted like so:
        - models are run for all variables specified by variable_schema
        - if variable_schema is None, models are run for all variables
        - each iteration, models build for fully recognized variables are
            always trained after the models trained during mice.
        - imputation_order does not have any affect on fully recognized
            variable model training.
    
    WARNING: Setting this to True without specifying a variable schema will build
    models for all variables in the dataset, whether they have missing values or
    not. This may or may not be what you want.
    

  • data_subset (None or int or float or dict.) –

    Subsets the data used in each iteration, which can save a significant amount of time.
    This can also help with memory consumption, as the candidate data must be copied to
    make a feature dataset for lightgbm.
    
    The number of rows used for each variable is (# rows in raw data) - (# missing variable values)
    for each variable. data_subset takes a random sample of this.
    
    If float, must be 0.0 < data_subset <= 1.0. Interpreted as a percentage of available candidates
    If int must be data_subset >= 0. Interpreted as the number of candidates.
    If 0, no subsetting is done.
    If dict, keys must be variable names, and values must follow two above rules.
    
    It is recommended to carefully select this value for each variable if dealing
    with very large data that barely fits into memory.
    

  • mean_match_scheme (Dict, default = None) –

    An instance of the miceforest.MeanMatchScheme class.
    
    If None is passed, a sensible default scheme is used. There are multiple helpful
    schemes that can be accessed from miceforest.builtin_mean_match_schemes, or
    you can build your own.
    
    A description of the defaults:
    - mean_match_default (default, if mean_match_scheme is None))
        This scheme is has medium speed and accuracy for most data.
    
        Categorical:
            If mmc = 0, the class with the highest probability is chosen.
            If mmc > 0, get N nearest neighbors from class probabilities.
                Select 1 at random.
        Numeric:
            If mmc = 0, the predicted value is used
            If mmc > 0, obtain the mmc closest candidate
                predictions and collect the associated
                real candidate values. Choose 1 randomly.
    
    - mean_match_shap
        This scheme is the most accurate, but takes the longest.
        It works the same as mean_match_default, except all nearest
        neighbor searches are performed on the shap values of the
        predictions, instead of the predictions themselves.
    
    - mean_match_scheme_fast_cat:
        This scheme is faster for categorical variables,
        but may be less accurate as well..
    
        Categorical:
            If mmc = 0, the class with the highest probability is chosen.
            If mmc > 0, return class based on random draw weighted by
                class probability for each sample.
        Numeric or binary:
            If mmc = 0, the predicted value is used
            If mmc > 0, obtain the mmc closest candidate
                predictions and collect the associated
                real candidate values. Choose 1 randomly.
    

  • categorical_feature (str or list, default="auto") –

    The categorical features in the dataset. This handling depends on class of impute_data:
    
        pandas DataFrame:
            - "auto": categorical information is inferred from any columns with
                datatype category or object.
            - list of column names (or indices): Useful if all categorical columns
                have already been cast to numeric encodings of some type, otherwise you
                should just use "auto". Will throw an error if a list is provided AND
                categorical dtypes exist in data. If a list is provided, values in the
                columns must be consecutive integers starting at 0, as required by lightgbm.
    
        numpy ndarray:
            - "auto": no categorical information is stored.
            - list of column indices: Specified columns are treated as categorical. Column
                values must be consecutive integers starting at 0, as required by lightgbm.
    

  • initialization (str) –

    "random" - missing values will be filled in randomly from existing values.
    "empty" - lightgbm will start MICE without initial imputation
    

  • save_all_iterations (boolean, optional(default=True)) –

    Save all the imputation values from all iterations, or just
    the latest. Saving all iterations allows for additional
    plotting, but may take more memory
    

  • save_models (int) –

    Which models should be saved:
        = 0: no models are saved. Cannot get feature importance or
            impute new data.
        = 1: only the last model iteration is saved. Can only get
            feature importance of last iteration. New data is
            imputed using the last model for all specified iterations.
            This is only an issue if data is heavily Missing At Random.
        = 2: all model iterations are saved. Can get feature importance
            for any iteration. When imputing new data, each iteration is
            imputed using the model obtained at that iteration in mice.
            This allows for imputations that most closely resemble those
            that would have been obtained in mice.
    

  • copy_data (boolean (default = False)) –

    Should the dataset be referenced directly? If False, this will cause
    the dataset to be altered in place. If a copy is created, it is saved
    in self.working_data. There are different ways in which the dataset
    can be altered:
    
    1) complete_data() will fill in missing values
    2) To save space, mice() references and manipulates self.working_data directly.
        If self.working_data is a reference to the original dataset, the original
        dataset will undergo these manipulations during the mice process.
        At the end of the mice process, missing values will be set back to np.NaN
        where they were originally missing.
    

  • save_loggers (boolean (default = False)) –

    A logger is created each time mice() or impute_new_data() is called.
    If True, the loggers are stored in a list ImputationKernel.loggers.
    If you wish to start saving logs, call ImputationKernel.start_logging().
    If you wish to stop saving logs, call ImputationKernel.stop_logging().
    

  • random_state (None,int, or numpy.random.RandomState) –

    The random_state ensures script reproducibility. It only ensures reproducible
    results if the same script is called multiple times. It does not guarantee
    reproducible results at the record level, if a record is imputed multiple
    different times. If reproducible record-results are desired, a seed must be
    passed for each record in the random_seed_array parameter.
    

__init__(data, datasets=1, variable_schema=None, imputation_order='ascending', train_nonmissing=False, mean_match_scheme=None, data_subset=None, categorical_feature='auto', initialization='random', save_all_iterations=True, save_models=1, copy_data=True, save_loggers=False, random_state=None)[source]

Methods

__init__(data[, datasets, variable_schema, ...])

append(imputation_kernel)

Combine two imputation kernels together.

compile_candidate_preds()

Candidate predictions can be pre-generated before imputing new data.

complete_data([dataset, iteration, inplace, ...])

Return dataset with missing values imputed.

dataset_count()

Return the number of datasets.

delete_candidate_preds()

Deletes the pre-computed candidate predictions.

fit(X, y, **fit_params)

Method for fitting a kernel when used in a sklearn pipeline.

get_correlations(datasets, variables)

Return the correlations between datasets for the specified variables.

get_feature_importance(dataset[, iteration])

Return a matrix of feature importance.

get_means(datasets[, variables])

Return a dict containing the average imputation value for specified variables at each iteration.

get_model(dataset, variable[, iteration])

Return the model for a specific dataset, variable, iteration.

get_raw_prediction(variable[, imp_dataset, ...])

Get the raw model output for a specific variable.

impute_new_data(new_data[, datasets, ...])

Impute a new dataset

iteration_count([datasets, variables])

Grabs the iteration count for specified variables, datasets.

mice([iterations, verbose, ...])

Perform mice on a given dataset.

plot_correlations([datasets, variables])

Plot the correlations between datasets.

plot_feature_importance(dataset[, ...])

Plot the feature importance.

plot_imputed_distributions([datasets, ...])

Plot the imputed value distributions.

plot_mean_convergence([datasets, variables])

Plots the average value of imputations over each iteration.

save_kernel(filepath[, clevel, cname, ...])

Compresses and saves the kernel to a file.

start_logging()

Start saving loggers to self.loggers

stop_logging()

Stop saving loggers to self.loggers

transform(X[, y])

Method for calling a kernel when used in a sklearn pipeline.

tune_parameters(dataset[, variables, ...])

Perform hyperparameter tuning on models at the current iteration.

append(imputation_kernel)[source]

Combine two imputation kernels together. For compatibility, the following attributes of each must be equal:

  • working_data

  • iteration_count

  • categorical_feature

  • mean_match_scheme

  • variable_schema

  • imputation_order

  • save_models

  • save_all_iterations

Only cursory checks are done to ensure working_data is equal. Appending a kernel with different working_data could ruin this kernel.

Parameters

imputation_kernel (ImputationKernel) – The kernel to merge.

compile_candidate_preds()[source]

Candidate predictions can be pre-generated before imputing new data. This can save a substantial amount of time, especially if save_models == 1.

complete_data(dataset=0, iteration=None, inplace=False, variables=None)

Return dataset with missing values imputed.

Parameters
  • dataset (int) – The dataset to complete.

  • iteration (int) – Impute data with values obtained at this iteration. If None, returns the most up-to-date iterations, even if different between variables. If not none, iteration must have been saved in imputed values.

  • inplace (bool) – Should the data be completed in place? If True, self.working_data is imputed,and nothing is returned. This is useful if the dataset is very large. If False, a copy of the data is returned, with missing values imputed.

Returns

Return type

The completed data, with values imputed for specified variables.

dataset_count()

Return the number of datasets. Datasets are defined by how many different sets of imputation values we have accumulated.

delete_candidate_preds()[source]

Deletes the pre-computed candidate predictions.

fit(X, y, **fit_params)[source]

Method for fitting a kernel when used in a sklearn pipeline. Should not be called by the user directly.

get_correlations(datasets, variables)

Return the correlations between datasets for the specified variables.

Parameters

variables (list[str], list[int]) – The variables to return the correlations for.

Returns

The correlations at each iteration for the specified variables.

Return type

dict

get_feature_importance(dataset, iteration=None)[source]

Return a matrix of feature importance. The cells represent the normalized feature importance of the columns to impute the rows. This is calculated internally by lightgbm.Booster.feature_importance().

Parameters
  • dataset (int) – The dataset to get the feature importance for.

  • iteration (int) – The iteration to return the feature importance for. Right now, the model must be saved to return importance

Returns

  • np.ndarray of importance values. Rows are imputed variables, and

  • columns are predictor variables.

get_means(datasets, variables=None)

Return a dict containing the average imputation value for specified variables at each iteration.

get_model(dataset, variable, iteration=None)[source]

Return the model for a specific dataset, variable, iteration.

Parameters
  • dataset (int) – The dataset to return the model for.

  • var (str) – The variable that was imputed

  • iteration (int) – The model iteration to return. Keep in mind if save_models ==1, the model was not saved. If none is provided, the latest model is returned.

  • Returns (lightgbm.Booster) – The model used to impute this specific variable, iteration.

get_raw_prediction(variable, imp_dataset=0, imp_iteration=None, model_dataset=None, model_iteration=None, dtype=None)[source]

Get the raw model output for a specific variable.

The data is pulled from the imp_dataset dataset, at the imp_iteration iteration. The model is pulled from model_dataset dataset, at the model_iteration iteration.

So, for example, it is possible to get predictions using the imputed values for dataset 3, at iteration 2, using the model obtained from dataset 10, at iteration 6. This is assuming desired iterations and models have been saved.

Parameters
  • variable (int or str) – The variable to get the raw predictions for. Can be an index or variable name.

  • imp_dataset (int) – The imputation dataset to use when creating the feature dataset.

  • imp_iteration (int) – The iteration from which to draw the imputation values when creating the feature dataset. If None, the latest iteration is used.

  • model_dataset (int) – The dataset from which to pull the trained model for this variable. If None, it is selected to be the same as imp_dataset.

  • model_iteration (int) – The iteration from which to pull the trained model for this variable If None, it is selected to be the same as imp_iteration.

  • dtype (str, np.dtype) – The datatype to cast the raw prediction as. Passed to MeanMatchScheme.model_predict().

Returns

Return type

np.ndarray of raw predictions.

impute_new_data(new_data, datasets=None, iterations=None, save_all_iterations=True, copy_data=True, random_state=None, random_seed_array=None, verbose=False)[source]

Impute a new dataset

Uses the models obtained while running MICE to impute new data, without fitting new models. Pulls mean matching candidates from the original data.

save_models must be > 0. If save_models == 1, the last model obtained in mice is used for every iteration. If save_models > 1, the model obtained at each iteration is used to impute the new data for that iteration. If specified iterations is greater than the number of iterations run so far using mice, the last model is used for each additional iteration.

Type checking is not done. It is up to the user to ensure that the kernel data matches the new data being imputed.

Parameters
  • new_data (pandas DataFrame or numpy ndarray) – The new data to impute

  • datasets (int or List[int] (default = None)) – The datasets from the kernel to use to impute the new data. If None, all datasets from the kernel are used.

  • iterations (int) – The number of iterations to run. If None, the same number of iterations run so far in mice is used.

  • save_all_iterations (bool) – Should the imputation values of all iterations be archived? If False, only the latest imputation values are saved.

  • copy_data (boolean) –

    Should the dataset be referenced directly? This will cause the dataset to be altered in place. If a copy is created, it is saved in self.working_data. There are different ways in which the dataset can be altered:

    1. complete_data() will fill in missing values

    2. mice() references and manipulates self.working_data directly.

  • random_state (int or np.random.RandomState or None (default=None)) – The random state of the process. Ensures reproducibility. If None, the random state of the kernel is used. Beware, this permanently alters the random state of the kernel and ensures non-reproduceable results, unless the entire process up to this point is re-run.

  • random_seed_array (None or np.ndarray (int32)) –

    Record-level seeds.
    
    Ensures deterministic imputations at the record level. random_seed_array causes
    deterministic imputations for each record no matter what dataset each record is
    imputed with, assuming the same number of iterations and datasets are used.
    If random_seed_array os passed, random_state must also be passed.
    
    Record-level imputations are deterministic if the following conditions are met:
        1) The associated seed is the same.
        2) The same kernel is used.
        3) The same number of iterations are run.
        4) The same number of datasets are run.
    
    Notes:
        a) This will slightly slow down the imputation process, because random
        number generation in numpy can no longer be vectorized. If you don't have a
        specific need for deterministic imputations at the record level, it is better to
        keep this parameter as None.
    
        b) Using this parameter may change the global numpy seed by calling np.random.seed().
    
        c) Internally, these seeds are hashed each time they are used, in order
        to obtain different results for each dataset / iteration.
    

verbose: boolean

Should information about the process be printed?

Returns

Return type

miceforest.ImputedData

iteration_count(datasets=None, variables=None)

Grabs the iteration count for specified variables, datasets. If the iteration count is not consistent across the provided datasets/variables, an error will be thrown. Providing None will use all datasets/variables.

This is to ensure the process is in a consistent state when the iteration count is needed.

Parameters
  • datasets (int or list[int]) – The datasets to check the iteration count for.

  • variables (int, str, list[int] or list[str]:) – The variables to check the iteration count for. Variables can be specified by their names or indexes.

Returns

Return type

An integer representing the iteration count.

mice(iterations=2, verbose=False, variable_parameters=None, compile_candidates=False, **kwlgb)[source]

Perform mice on a given dataset.

Multiple Imputation by Chained Equations (MICE) is an iterative method which fills in (imputes) missing data points in a dataset by modeling each column using the other columns, and then inferring the missing data.

For more information on MICE, and missing data in general, see Stef van Buuren’s excellent online book: https://stefvanbuuren.name/fimd/ch-introduction.html

For detailed usage information, see this project’s README on the github repository: https://github.com/AnotherSamWilson/miceforest

Parameters
  • iterations (int) – The number of iterations to run.

  • verbose (bool) – Should information about the process be printed?

  • variable_parameters (None or dict) – Model parameters can be specified by variable here. Keys should be variable names or indices, and values should be a dict of parameter which should apply to that variable only.

  • compile_candidates (bool) – Candidate predictions can be stored as they are created while performing mice. This prevents kernel.compile_candidate_preds() from having to be called separately, and can save a significant amount of time if compiled candidate predictions are desired.

  • kwlgb – Additional arguments to pass to lightgbm. Applied to all models.

plot_correlations(datasets=None, variables=None, **adj_args)

Plot the correlations between datasets. See get_correlations() for more details.

Parameters
  • datasets (None or list[int]) – The datasets to plot.

  • variables (None,list) – The variables to plot.

  • adj_args – Additional arguments passed to plt.subplots_adjust()

plot_feature_importance(dataset, normalize=True, iteration=None, **kw_plot)[source]

Plot the feature importance. See get_feature_importance() for more details.

Parameters
  • dataset (int) – The dataset to plot the feature importance for.

  • iteration (int) – The iteration to plot the feature importance of.

  • normalize (book) – Should the values be normalize from 0-1? If False, values are raw from Booster.feature_importance()

  • kw_plot – Additional arguments sent to sns.heatmap()

plot_imputed_distributions(datasets=None, variables=None, iteration=None, **adj_args)

Plot the imputed value distributions. Red lines are the distribution of original data Black lines are the distribution of the imputed values.

Parameters
  • datasets (None, int, list[int]) –

  • variables (None, str, int, list[str], or list[int]) – The variables to plot. If None, all numeric variables are plotted.

  • iteration (None, int) – The iteration to plot the distribution for. If None, the latest iteration is plotted. save_all_iterations must be True if specifying an iteration.

  • adj_args – Additional arguments passed to plt.subplots_adjust()

plot_mean_convergence(datasets=None, variables=None, **adj_args)

Plots the average value of imputations over each iteration.

Parameters
  • variables (None or list) – The variables to plot. Must be numeric.

  • adj_args – Passed to matplotlib.pyplot.subplots_adjust()

save_kernel(filepath, clevel=None, cname=None, n_threads=None, copy_while_saving=True)[source]

Compresses and saves the kernel to a file.

Parameters
  • filepath (str) – The file to save to.

  • clevel (int) – The compression level, sent to clevel argument in blosc.compress()

  • cname (str) – The compression algorithm used. Sent to cname argument in blosc.compress. If None is specified, the default is lz4hc.

  • n_threads (int) – The number of threads to use for compression. By default, all threads are used.

  • copy_while_saving (boolean) – Should the kernel be copied while saving? Copying is safer, but may take more memory.

start_logging()[source]

Start saving loggers to self.loggers

stop_logging()[source]

Stop saving loggers to self.loggers

transform(X, y=None)[source]

Method for calling a kernel when used in a sklearn pipeline. Should not be called by the user directly.

tune_parameters(dataset, variables=None, variable_parameters=None, parameter_sampling_method='random', nfold=10, optimization_steps=5, random_state=None, verbose=False, **kwbounds)[source]

Perform hyperparameter tuning on models at the current iteration.

A few notes:
- Underlying models will now be gradient boosted trees by default (or any
    other boosting type compatible with lightgbm.cv).
- The parameters are tuned on the data that would currently be returned by
    complete_data(dataset). It is usually a good idea to run at least 1 iteration
    of mice with the default parameters to get a more accurate idea of the
    real optimal parameters, since Missing At Random (MAR) data imputations
    tend to converge over time.
- num_iterations is treated as the maximum number of boosting rounds to run
    in lightgbm.cv. It is NEVER optimized. The num_iterations that is returned
    is the best_iteration returned by lightgbm.cv. num_iterations can be passed to
    limit the boosting rounds, but the returned value will always be obtained
    from best_iteration.
- lightgbm parameters are chosen in the following order of priority:
    1) Anything specified in variable_parameters
    2) Parameters specified globally in **kwbounds
    3) Default tuning space (miceforest.default_lightgbm_parameters.make_default_tuning_space)
    4) Default parameters (miceforest.default_lightgbm_parameters.default_parameters)
- See examples for a detailed run-through. See
    https://github.com/AnotherSamWilson/miceforest#Tuning-Parameters
    for even more detailed examples.
Parameters
  • dataset (int (required)) –

    The dataset to run parameter tuning on. Tuning parameters on 1 dataset usually results
    in acceptable parameters for all datasets. However, tuning results are still stored
    seperately for each dataset.
    

  • variables (None or list) –

    - If None, default hyper-parameter spaces are selected based on kernel data, and
    all variables with missing values are tuned.
    - If list, must either be indexes or variable names corresponding to the variables
    that are to be tuned.
    

  • variable_parameters (None or dict) –

    Defines the tuning space. Dict keys must be variable names or indices, and a subset
    of the variables parameter. Values must be a dict with lightgbm parameter names as
    keys, and values that abide by the following rules:
        scalar: If a single value is passed, that parameter will be used to build the
            model, and will not be tuned.
        tuple: If a tuple is passed, it must have length = 2 and will be interpreted as
            the bounds to search within for that parameter.
        list: If a list is passed, values will be randomly selected from the list.
            NOTE: This is only possible with method = 'random'.
    
    example: If you wish to tune the imputation model for the 4th variable with specific
    bounds and parameters, you could pass:
        variable_parameters = {
            4: {
                'learning_rate: 0.01',
                'min_sum_hessian_in_leaf: (0.1, 10),
                'extra_trees': [True, False]
            }
        }
    All models for variable 4 will have a learning_rate = 0.01. The process will randomly
    search within the bounds (0.1, 10) for min_sum_hessian_in_leaf, and extra_trees will
    be randomly selected from the list. Also note, the variable name for the 4th column
    could also be passed instead of the integer 4. All other variables will be tuned with
    the default search space, unless **kwbounds are passed.
    

  • parameter_sampling_method (str) –

    If 'random', parameters are randomly selected.
    Other methods will be added in future releases.
    

  • nfold (int) –

    The number of folds to perform cross validation with. More folds takes longer, but
    Gives a more accurate distribution of the error metric.
    

  • optimization_steps

    How many steps to run the process for.
    

  • random_state (int or np.random.RandomState or None (default=None)) –

    The random state of the process. Ensures reproduceability. If None, the random state
    of the kernel is used. Beware, this permanently alters the random state of the kernel
    and ensures non-reproduceable results, unless the entire process up to this point
    is re-run.
    

  • kwbounds

    Any additional arguments that you want to apply globally to every variable.
    For example, if you want to limit the number of iterations, you could pass
    num_iterations = x to this functions, and it would apply globally. Custom
    bounds can also be passed.
    

Returns

  • 2 dicts (optimal_parameters, optimal_parameter_losses)

  • - optimal_parameters (dict) – A dict of the optimal parameters found for each variable. This can be passed directly to the variable_parameters parameter in mice()

    {variable: {parameter_name: parameter_value}}
    
  • - optimal_parameter_losses (dict) – The average out of fold cv loss obtained directly from lightgbm.cv() associated with the optimal parameter set.

    {variable: loss}