miceforest.ImputedData

class miceforest.ImputedData(impute_data, datasets=5, variable_schema=None, imputation_order='ascending', train_nonmissing=False, categorical_feature='auto', save_all_iterations=True, copy_data=True)[source]

Bases: object

Imputed Data

This class should not be instantiated directly. Instead, it is returned when ImputationKernel.impute_new_data() is called. For parameter arguments, see ImputationKernel documentation.

__init__(impute_data, datasets=5, variable_schema=None, imputation_order='ascending', train_nonmissing=False, categorical_feature='auto', save_all_iterations=True, copy_data=True)[source]

Methods

`__init__`(impute_data[, datasets, ...])
`complete_data`([dataset, iteration, inplace, ...])	Return dataset with missing values imputed.
`dataset_count`()	Return the number of datasets.
`get_correlations`(datasets, variables)	Return the correlations between datasets for the specified variables.
`get_means`(datasets[, variables])	Return a dict containing the average imputation value for specified variables at each iteration.
`iteration_count`([datasets, variables])	Grabs the iteration count for specified variables, datasets.
`plot_correlations`([datasets, variables])	Plot the correlations between datasets.
`plot_imputed_distributions`([datasets, ...])	Plot the imputed value distributions.
`plot_mean_convergence`([datasets, variables])	Plots the average value of imputations over each iteration.

complete_data(dataset=0, iteration=None, inplace=False, variables=None)[source]

Return dataset with missing values imputed.

Parameters

dataset (int) – The dataset to complete.
iteration (int) – Impute data with values obtained at this iteration. If None, returns the most up-to-date iterations, even if different between variables. If not none, iteration must have been saved in imputed values.
inplace (bool) – Should the data be completed in place? If True, self.working_data is imputed,and nothing is returned. This is useful if the dataset is very large. If False, a copy of the data is returned, with missing values imputed.

Returns

Return type

The completed data, with values imputed for specified variables.

dataset_count()[source]: Return the number of datasets. Datasets are defined by how many different sets of imputation values we have accumulated.

get_correlations(datasets, variables)[source]

Return the correlations between datasets for the specified variables.

Parameters: variables (list[str], list[int]) – The variables to return the correlations for.
Returns: The correlations at each iteration for the specified variables.
Return type: dict

get_means(datasets, variables=None)[source]: Return a dict containing the average imputation value for specified variables at each iteration.

iteration_count(datasets=None, variables=None)[source]

Grabs the iteration count for specified variables, datasets. If the iteration count is not consistent across the provided datasets/variables, an error will be thrown. Providing None will use all datasets/variables.

This is to ensure the process is in a consistent state when the iteration count is needed.

Parameters

datasets (int or list[int]) – The datasets to check the iteration count for.
variables (int, str, list[int] or list[str]:) – The variables to check the iteration count for. Variables can be specified by their names or indexes.

Returns

Return type

An integer representing the iteration count.

plot_correlations(datasets=None, variables=None, **adj_args)[source]

Plot the correlations between datasets. See get_correlations() for more details.

Parameters

datasets (None or list[int]) – The datasets to plot.
variables (None,list) – The variables to plot.
adj_args – Additional arguments passed to plt.subplots_adjust()

plot_imputed_distributions(datasets=None, variables=None, iteration=None, **adj_args)[source]

Plot the imputed value distributions. Red lines are the distribution of original data Black lines are the distribution of the imputed values.

Parameters

datasets (None, int, list[int]) –
variables (None, str, int, list[str], or list[int]) – The variables to plot. If None, all numeric variables are plotted.
iteration (None, int) – The iteration to plot the distribution for. If None, the latest iteration is plotted. save_all_iterations must be True if specifying an iteration.
adj_args – Additional arguments passed to plt.subplots_adjust()

plot_mean_convergence(datasets=None, variables=None, **adj_args)[source]

Plots the average value of imputations over each iteration.

Parameters

variables (None or list) – The variables to plot. Must be numeric.
adj_args – Passed to matplotlib.pyplot.subplots_adjust()