Data Imputation

Data imputation is a critical technique in data preprocessing, aimed at addressing the challenge of missing data in datasets. Various imputation methods range from simple strategies, like mean, median, or mode substitution, to more sophisticated techniques such as regression imputation, multiple imputation, and machine learning-based imputation. These methods help maintain the statistical properties of the data, improve the performance of machine learning models, and enable more accurate analyses. Effective imputation preserves the underlying structure and relationships within the data, mitigating biases and maintaining the validity of inferences drawn from the dataset.

The concept of data imputation has been around since the early days of statistics, but it gained significant traction in the 1970s with the development of more advanced statistical methods. The introduction of multiple imputation by Donald B. Rubin in 1987 marked a pivotal moment, providing a robust framework for handling missing data in a statistically sound manner.

Donald B. Rubin is a prominent figure in the field of data imputation, especially known for developing the multiple imputation technique. Other notable contributors include Paul D. Allison, who has extensively written about statistical methods for handling missing data, and Roderick J. Little, known for his work on missing data theory and methods.

Data Imputation

Explainer

Data Imputation Demo

Original Dataset

Key Contributors

Newsletter

Academic Papers

Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021)

Missing data imputation for supervised learning

Deep learning versus conventional methods for missing data imputation: A review and comparative study

A systematic review of machine learning-based missing value imputation techniques

A comprehensive survey on imputation of missing data in internet of things