Discrete Applied Mathematics 159 (2011) 1040–1047
Contents lists available at ScienceDirect
Discrete Applied Mathematics
journal homepage: www.elsevier.com/locate/dam
A new imputation method for incomplete binary data
Munevver Mine Subasi
a
, Ersoy Subasi
b,∗
, Martin Anthony
c
, Peter L. Hammer
1
a
Department of Mathematical Sciences, Florida Institute of Technology, 150 W. University Blvd., Melbourne, FL 32901, USA
b
RUTCOR, Rutgers Center for Operations Research, 640 Bartholomew Road, Piscataway, NJ 08854, USA
c
Department of Mathematics, London School of Economics and Political Sciences, Houghton Street, London WC2A 2AE, UK
a r t i c l e i n f o
Article history:
Received 17 October 2009
Received in revised form 28 August 2010
Accepted 31 January 2011
Available online 21 March 2011
Keywords:
Imputation
Boolean similarity measure
a b s t r a c t
In data analysis problems where the data are represented by vectors of real numbers, it is
often the case that some of the data-points will have ‘‘missing values’’, meaning that one or
more of the entries of the vector that describes the data-point is not observed. In this paper,
we propose a new approach to the imputation of missing binary values. The technique we
introduce employs a ‘‘similarity measure’’ introduced by Anthony and Hammer (2006) [1].
We compare experimentally the performance of our technique with ones based on the
usual Hamming distance measure and multiple imputation.
© 2011 Elsevier B.V. All rights reserved.
1. Introduction
In practical machine learning or data analysis problems in which the data to be analyzed consists of vectors of real
numbers; it is often the case that some of the data-points will have ‘‘missing values’’, meaning that one or more of the
entries of the vector that describes the data-point is not known. It is natural to try to ‘‘fill in’’ or impute these missing values
so that one than has complete data to work from. This may be necessary, for instance, so that the data can be used to learn
from using statistical or machine learning techniques. This is a classical statistical and machine learning problem and many
techniques have been employed.
Since in real-life applications missing data are a nuisance rather than the primary focus, an imputation method with good
properties can be preferable to one that is complicated to implement and more efficient, but problem-specific.
Some approaches to handling missing data simply ignore or delete points that are incomplete. Classical approaches of this
type are list-wise deletion (LD) and pairwise deletion (PD). Because of their simplicity, they are widely used (see, e.g., [15])
and tend to be the default for most statistical packages. However, the application of these techniques may lead to a large
loss of observations, which may result in data-sets that are too small if the fraction of missing values is high, and particularly
if the original data-set is itself small.
One of the most challenging decisions confronting researchers is choosing the most appropriate method to handle
missing data during analysis. Little and Rubin [13] suggests that naive or unprincipled imputation methods may create
more problems than they solve. The most common data imputation techniques are mean imputation also referred to
as unconditional mean imputation, regression imputation (RI) also referred to as conditional mean imputation, hot-deck
imputation (HDI) and multiple imputation (MI). We remark that the mean imputation and similar approaches are not proper
in the sense of Rubin [16] and hence, are not recommended. In most situations, simple techniques for handling missing data
(such as complete case analysis methods LD and PD, overall MI, and the missing-indicator method) produce biased results
as documented in [5,12,16,18,21]. A more sophisticated technique MI gives much better results [5,12,16,18,21].
∗
Corresponding author. Tel.: +1 321 674 7486; fax: +1 321 674 7412.
E-mail addresses: msubasi@fit.edu (M.M. Subasi), esub@rutcor.rutgers.edu (E. Subasi), m.anthony@lse.ac.uk (M. Anthony).
1
Deceased.
0166-218X/$ – see front matter © 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.dam.2011.01.024