Categorical data imputation using non-parametric or semi-parametric imputation methods

No Thumbnail Available

Date

2016-05-11

Authors

Khosa, Floyd Vukosi

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Researchers and data analysts often encounter a problem when analysing data with missing values. Methods for imputing continuous data are well developed in the literature. However, methods for imputing categorical data are not well established. This research report focuses on categorical data imputation using non-parametric and semi-parametric methods. The aims of the study are to compare different imputation methods for categorical data and to assess the quality of the imputation. Three imputation methods are compared namely; multiple imputation, hot deck imputation and random forest imputation. Missing data are created on a complete data set using the missing completely at random mechanism. The imputed data sets are compared with the original complete data set, and the imputed values which are the same as the values in the original data set are counted. The analysis revealed that the hot deck imputation method is more precise, compared to random forest and multiple imputation methods. Logistic regression is fitted on the imputed data sets and the original data set and the resulting models are compared. The analysis shows that the multiple imputation method affects the model fit of the logistic regression negatively.

Description

A research report submitted to the Faculty of Science, University of the Witwatersrand, for the degree of Master of Science by Coursework and Research Report.

Keywords

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By