Abstract:
Missing data are commonly encountered in most medical research. Unfortunately,
they are often neglected or not properly handled during analytic procedures, and
this may substantially bias the results of the study, reduce the study power, and
lead to invalid conclusions. In this study, we introduce key concepts regarding
missing data in survey data analysis, provide a conceptual framework on how
to approach missing data in this setting, describe typical mechanisms of missing
data, and use a theoretical model for handling such data. We consider a case
where the variable of interest (response variable) is binary and some of the observations
are missing and assume that all the covariates are fully observed. In
most cases, the statistic of interest, when faced with binary data is the prevalence.
We develop a two stage approach to improve the prevalence estimates: in
the rst stage, we use a logistic regression model to predict the missing binary
observations and then in the second stage we recalculate the prevalence using
the observed binary data and the imputed missing data. Finally we study the
asymptotic properties of the prevalence estimator. Such a model would be of
great interest in research studies involving HIV in which people usually refuse
to donate blood for testing yet they are willing to provide other covariates. The
prevalence estimation method is illustrated using simulated data and applied to
HIV/AIDS data from the Kenya AIDS Indicator Survey, 2007.