miceforest.ImputedData

class miceforest.ImputedData(impute_data, datasets=5, variable_schema=None, imputation_order='ascending', train_nonmissing=False, categorical_feature='auto', save_all_iterations=True, copy_data=True)[source]

Bases: object

Imputed Data

This class should not be instantiated directly. Instead, it is returned when ImputationKernel.impute_new_data() is called. For parameter arguments, see ImputationKernel documentation.

__init__(impute_data, datasets=5, variable_schema=None, imputation_order='ascending', train_nonmissing=False, categorical_feature='auto', save_all_iterations=True, copy_data=True)[source]

Methods

__init__(impute_data[, datasets, ...])

complete_data([dataset, iteration, inplace, ...])

Return dataset with missing values imputed.

dataset_count()

Return the number of datasets.

get_correlations(datasets, variables)

Return the correlations between datasets for the specified variables.

get_means(datasets[, variables])

Return a dict containing the average imputation value for specified variables at each iteration.

iteration_count([datasets, variables])

Grabs the iteration count for specified variables, datasets.

plot_correlations([datasets, variables])

Plot the correlations between datasets.

plot_imputed_distributions([datasets, ...])

Plot the imputed value distributions.

plot_mean_convergence([datasets, variables])

Plots the average value of imputations over each iteration.

complete_data(dataset=0, iteration=None, inplace=False, variables=None)[source]

Return dataset with missing values imputed.

Parameters
  • dataset (int) – The dataset to complete.

  • iteration (int) – Impute data with values obtained at this iteration. If None, returns the most up-to-date iterations, even if different between variables. If not none, iteration must have been saved in imputed values.

  • inplace (bool) – Should the data be completed in place? If True, self.working_data is imputed,and nothing is returned. This is useful if the dataset is very large. If False, a copy of the data is returned, with missing values imputed.

Returns

Return type

The completed data, with values imputed for specified variables.

dataset_count()[source]

Return the number of datasets. Datasets are defined by how many different sets of imputation values we have accumulated.

get_correlations(datasets, variables)[source]

Return the correlations between datasets for the specified variables.

Parameters

variables (list[str], list[int]) – The variables to return the correlations for.

Returns

The correlations at each iteration for the specified variables.

Return type

dict

get_means(datasets, variables=None)[source]

Return a dict containing the average imputation value for specified variables at each iteration.

iteration_count(datasets=None, variables=None)[source]

Grabs the iteration count for specified variables, datasets. If the iteration count is not consistent across the provided datasets/variables, an error will be thrown. Providing None will use all datasets/variables.

This is to ensure the process is in a consistent state when the iteration count is needed.

Parameters
  • datasets (int or list[int]) – The datasets to check the iteration count for.

  • variables (int, str, list[int] or list[str]:) – The variables to check the iteration count for. Variables can be specified by their names or indexes.

Returns

Return type

An integer representing the iteration count.

plot_correlations(datasets=None, variables=None, **adj_args)[source]

Plot the correlations between datasets. See get_correlations() for more details.

Parameters
  • datasets (None or list[int]) – The datasets to plot.

  • variables (None,list) – The variables to plot.

  • adj_args – Additional arguments passed to plt.subplots_adjust()

plot_imputed_distributions(datasets=None, variables=None, iteration=None, **adj_args)[source]

Plot the imputed value distributions. Red lines are the distribution of original data Black lines are the distribution of the imputed values.

Parameters
  • datasets (None, int, list[int]) –

  • variables (None, str, int, list[str], or list[int]) – The variables to plot. If None, all numeric variables are plotted.

  • iteration (None, int) – The iteration to plot the distribution for. If None, the latest iteration is plotted. save_all_iterations must be True if specifying an iteration.

  • adj_args – Additional arguments passed to plt.subplots_adjust()

plot_mean_convergence(datasets=None, variables=None, **adj_args)[source]

Plots the average value of imputations over each iteration.

Parameters
  • variables (None or list) – The variables to plot. Must be numeric.

  • adj_args – Passed to matplotlib.pyplot.subplots_adjust()