nuevas tecnicas de imputacion

15
Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques Md. Geaur Rahman, Md Zahidul Islam Center for Research in Complex Systems, School of Computing and Mathematics, Charles Sturt University, Bathurst, NSW 2795, Australia article info Article history: Received 11 April 2013 Received in revised form 6 August 2013 Accepted 7 August 2013 Available online 4 September 2013 Keywords: Data pre-processing Data cleansing Missing value imputation Decision tree algorithm Decision forest algorithm EM algorithm abstract We present two novel techniques for the imputation of both categorical and numerical missing values. The techniques use decision trees and forests to identify horizontal segments of a data set where the records belonging to a segment have higher similarity and attribute correlations. Using the similarity and correlations, missing values are then imputed. To achieve a higher quality of imputation some seg- ments are merged together using a novel approach. We use nine publicly available data sets to experi- mentally compare our techniques with a few existing ones in terms of four commonly used evaluation criteria. The experimental results indicate a clear superiority of our techniques based on statistical anal- yses such as confidence interval. Ó 2013 Elsevier B.V. All rights reserved. 1. Introduction Organisations are extremely dependent nowadays on data col- lection, storage and analysis for various decision-making purposes. Collected data often have incorrect and missing values [1–3]. For imputing a missing value r ij (i.e. the jth attribute value of the ith re- cord r i ) in a data set D F , an existing technique kNNI [4] first finds the k-most similar records of r i from D F by using the Euclidean dis- tance measure. If A j 2 A is a categorical attribute the technique im- putes r ij by using the most frequent value of A j (the jth attribute) within the k-Nearest Neighbor (k-NN) records. If A j is numerical the technique then utilizes the mean value of A j for the k-NN re- cords in order to impute r ij . While simplicity is an advantage of kNNI, a major disadvantage of the technique is the fact that for each record having missing value/s it needs to search the whole data set in order to find the k-nearest neighbors. Therefore, the technique can be expensive for a large data set [4]. Moreover, kNNI requires user input for k as it does not find the best k value auto- matically. It can be difficult for a user to estimate a suitable k value. Therefore, a suitable k value is often estimated automatically in some existing techniques such as Local Weighted Linear Approxi- mation Imputation (LWLA) [5]. In order to estimate k automati- cally, LWLA first artificially creates a missing value r il for a record r i that has a real missing value r ij . Note that the original value of r il is known. The technique uses all possible k values, and for each k value it finds the set of k-NN records of r i . For each set of k-NN records, it then imputes the missing value r il using the mean of the lth attribute of all records belonging to the set of k-NN records. Based on the imputed value and the actual value of r il , LWLA calcu- lates the normalized Root Mean Squared Error (RMSE). RMSE is cal- culated for all sets of k-NN records. The best k value is estimated from the set of k-NN records that produces the minimum RMSE va- lue. Finally LWLA imputes a real missing value r ij , by using each re- cord r m belonging to the set of k-NN records (using the best k) of r i , the similarity between r i and r m , and the jth attribute value of r m . LWLA uses the most relevant horizontal segment of a data set through the use of the best k-NN records. However, it does not pay attention to the relevance of the attributes and thus it does not divide a data set vertically. A recent technique called ‘‘Iterative Bi-Cluster based Local Least Square Imputation’’ (IBLLS) [6] divides a data set in both horizontal and vertical segments for imputing a numerical missing value, r ij , of a record r i . Similar to LWLA, it first automatically identifies the k-most sim- ilar records for r i . Based on the k-most similar records it then cal- culates the correlation matrix R for the attributes with available values in r i and the attributes whose values are missing in r i . It next re-calculates a new set of k number of records (for the same k that was automatically identified) using R, and a weighted Euclidean distance (WED) [6]. For the WED calculation the attributes having high correlation with the jth attribute (i.e. the attribute having a missing value for r ij ) are taken more seriously than the attributes having low correlations. 0950-7051/$ - see front matter Ó 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.knosys.2013.08.023 Corresponding author. Tel.: +61 2 633 84214; fax: +61 2 633 84649. E-mail addresses: [email protected] (M.G. Rahman), [email protected] (M.Z. Islam). Knowledge-Based Systems 53 (2013) 51–65 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Upload: glpi

Post on 02-May-2017

237 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Nuevas Tecnicas de Imputacion

Knowledge-Based Systems 53 (2013) 51–65

Contents lists available at ScienceDirect

Knowledge-Based Systems

journal homepage: www.elsevier .com/ locate /knosys

Missing value imputation using decision trees and decision forestsby splitting and merging records: Two novel techniques

0950-7051/$ - see front matter � 2013 Elsevier B.V. All rights reserved.http://dx.doi.org/10.1016/j.knosys.2013.08.023

⇑ Corresponding author. Tel.: +61 2 633 84214; fax: +61 2 633 84649.E-mail addresses: [email protected] (M.G. Rahman), [email protected] (M.Z.

Islam).

Md. Geaur Rahman, Md Zahidul Islam ⇑Center for Research in Complex Systems, School of Computing and Mathematics, Charles Sturt University, Bathurst, NSW 2795, Australia

a r t i c l e i n f o

Article history:Received 11 April 2013Received in revised form 6 August 2013Accepted 7 August 2013Available online 4 September 2013

Keywords:Data pre-processingData cleansingMissing value imputationDecision tree algorithmDecision forest algorithmEM algorithm

a b s t r a c t

We present two novel techniques for the imputation of both categorical and numerical missing values.The techniques use decision trees and forests to identify horizontal segments of a data set where therecords belonging to a segment have higher similarity and attribute correlations. Using the similarityand correlations, missing values are then imputed. To achieve a higher quality of imputation some seg-ments are merged together using a novel approach. We use nine publicly available data sets to experi-mentally compare our techniques with a few existing ones in terms of four commonly used evaluationcriteria. The experimental results indicate a clear superiority of our techniques based on statistical anal-yses such as confidence interval.

� 2013 Elsevier B.V. All rights reserved.

1. Introduction

Organisations are extremely dependent nowadays on data col-lection, storage and analysis for various decision-making purposes.Collected data often have incorrect and missing values [1–3]. Forimputing a missing value rij (i.e. the jth attribute value of the ith re-cord ri) in a data set DF, an existing technique kNNI [4] first findsthe k-most similar records of ri from DF by using the Euclidean dis-tance measure. If Aj 2 A is a categorical attribute the technique im-putes rij by using the most frequent value of Aj (the jth attribute)within the k-Nearest Neighbor (k-NN) records. If Aj is numericalthe technique then utilizes the mean value of Aj for the k-NN re-cords in order to impute rij. While simplicity is an advantage ofkNNI, a major disadvantage of the technique is the fact that foreach record having missing value/s it needs to search the wholedata set in order to find the k-nearest neighbors. Therefore, thetechnique can be expensive for a large data set [4]. Moreover, kNNIrequires user input for k as it does not find the best k value auto-matically. It can be difficult for a user to estimate a suitable k value.

Therefore, a suitable k value is often estimated automatically insome existing techniques such as Local Weighted Linear Approxi-mation Imputation (LWLA) [5]. In order to estimate k automati-cally, LWLA first artificially creates a missing value ril for a recordri that has a real missing value rij. Note that the original value of

ril is known. The technique uses all possible k values, and for eachk value it finds the set of k-NN records of ri. For each set of k-NNrecords, it then imputes the missing value ril using the mean ofthe lth attribute of all records belonging to the set of k-NN records.Based on the imputed value and the actual value of ril, LWLA calcu-lates the normalized Root Mean Squared Error (RMSE). RMSE is cal-culated for all sets of k-NN records. The best k value is estimatedfrom the set of k-NN records that produces the minimum RMSE va-lue. Finally LWLA imputes a real missing value rij, by using each re-cord rm belonging to the set of k-NN records (using the best k) of ri,the similarity between ri and rm, and the jth attribute value of rm.

LWLA uses the most relevant horizontal segment of a data setthrough the use of the best k-NN records. However, it does notpay attention to the relevance of the attributes and thus it doesnot divide a data set vertically. A recent technique called ‘‘IterativeBi-Cluster based Local Least Square Imputation’’ (IBLLS) [6] dividesa data set in both horizontal and vertical segments for imputing anumerical missing value, rij, of a record ri.

Similar to LWLA, it first automatically identifies the k-most sim-ilar records for ri. Based on the k-most similar records it then cal-culates the correlation matrix R for the attributes with availablevalues in ri and the attributes whose values are missing in ri. It nextre-calculates a new set of k number of records (for the same k thatwas automatically identified) using R, and a weighted Euclideandistance (WED) [6]. For the WED calculation the attributes havinghigh correlation with the jth attribute (i.e. the attribute having amissing value for rij) are taken more seriously than the attributeshaving low correlations.

Page 2: Nuevas Tecnicas de Imputacion

52 M.G. Rahman, M.Z. Islam / Knowledge-Based Systems 53 (2013) 51–65

IBLLS further divides the k-NN records vertically by consideringonly the attributes having high correlations with the jth attribute.Within this partition it then applies an imputation technique calledLocal Least Square framework [7] in order to impute rij. This proce-dure is repeated for imputing each missing value of ri. Similarly, allother records of the data set having missing values are imputed.

Another group of techniques [8–10] uses classifiers such as aneural network for imputation. For example, a technique [10] usesa three layered perceptron network in which the number of neu-rons in both input and output layers are equal to the number ofattributes of the data set DF. The method first generates a per-turbed data set, Dp from DF by considering some available valuesas missing. Using the values of Dp into the input layer and the val-ues of DF into the output layer, it then trains the network. It finallyimputes the missing values of DF using the trained neural network.The performance of a neural network can be improved by using aGenetic Algorithm (GA) in the training process of the network [8].

Genetic Algorithm (GA) can also be used to estimate a suitableset of parameters of a Fuzzy C-means algorithm (FCM) which canthen be used to impute missing values [11]. In an existing tech-nique [11], a missing value is first imputed separately using a Sup-port Vector Regression (SVR) and an FCM with user definedparameters. The imputed values are then compared to test theirmutual agreement. If the imputed values are not similar then aGA technique is applied to re-estimate the parameters of FCM.The new parameters are used in FCM to impute the missing valueagain. The process continues until the FCM and SVR imputed val-ues are similar to each other. GA can also be used to estimate suit-able parameters for FCM by first artificially creating missing values,and then imputing the artificial missing values by FCM. Once theartificially created missing values are imputed then the differencebetween the original and imputed values can be used in GA as a fit-ness measure for further improvement of imputation [11]. OftenFCM based imputation techniques are more tolerant of imprecisionand uncertainty [12].

Unlike the family of k-NN imputation techniques, EMI [13]takes a global approach in the sense that it imputes missing valuesusing the whole data set, instead of a horizontal segment of it. Forimputing a missing value EMI relies on the correlations among theattributes. The techniques using a localized approach, such as thefamily of k-NN imputation methods, have an advantage of imput-ing a missing value based on the k nearest neighbors (i.e. k mostsimilar records) instead of all records. However, k-NN techniquesfind the k nearest neighbors based only on similarity and withoutconsidering correlations of attributes within the k nearest neigh-bors. Since EMI relies on correlations it is more likely to perform

(a) A Sample data set (b) Tx

(d) Leaves of Tx (e) Leaves of Ty

Fig. 1. Basic conc

better on a data set having high correlations for the attributes. Cor-relations of the attributes are natural properties of a data set andthey cannot be improved or modified for the data set. However,we realize that it is often possible to have horizontal segments(within a data set) where there are higher correlations than thecorrelations over the whole data set. Hence, the identification ofthe horizontal segments having high correlations and applicationof EMI algorithm within the segments is expected to produce a bet-ter imputation result.

In this paper, we propose two novel imputation techniquescalled DMI [14] and SiMI. DMI makes use of any existing decisiontree algorithm such as C4.5 [15], and an Expectation Maximisation(EM) based imputation technique called EMI [13]. SiMI is an exten-sion of DMI. It uses any existing decision forest algorithm (such asSysFor [16]) and EMI, along with our novel splitting and mergingapproach. DMI and SiMI can impute both numerical and categori-cal attributes. We evaluate our techniques on nine data sets basedon four evaluation criteria, namely co-efficient of determination(R2), Index of agreement (d2), root mean squared error (RMSE)and mean absolute error (MAE) [17,18]. Our experimental resultsindicate that DMI and SiMI perform significantly better than EMI[13] and IBLLS [6].

The organization of the paper is as follows. Section 2 presentsour techniques (DMI and SiMI). Experimental results are presentedin Section 3. Section 4 gives concluding remarks.

2. Our novel imputation techniques

2.1. Basic concept

EMI based imputation techniques [13] rely on the correlationsof the attributes of a data set. We realize/argue that their imputa-tion accuracies can be high for a data set having high correlationsfor the attributes. We also realize that perhaps correlations amongthe attributes can be higher within some horizontal partitions of adata set than within the whole data set. Therefore, we focus onexploring such partitions for higher imputation accuracy. We usea decision tree or a decision forest to identify such natural parti-tions. A decision tree divides a data set into a number of leaveshaving sets of mutually exclusive records. A decision forest buildsa number of decision trees.

Let us assume that we build a forest having two trees Tx and Ty

(Fig. 1b and c) from a sample data set (Fig. 1a). Suppose a leaf L1

belonging to Tx has five records, L1 = {R1,R2,R3,R4,R5} (Fig. 1b andd) and a leaf L4 belonging to Ty has five records, L4 = {R1,R2,R3,R6, -R7} (Fig. 1c and e). Records belonging to any leaf are considered to

(c) Ty

(f) Intersections (g) Merging

ept of SiMI.

Page 3: Nuevas Tecnicas de Imputacion

0.8430

M.G. Rahman, M.Z. Islam / Knowledge-Based Systems 53 (2013) 51–65 53

be similar to each other since they share the same or very similarvalues for the test attributes [19,20]. For example, records belong-ing to the leaf L1 (in Tx) can be considered to be similar to eachother since they share similar values for Ai and Aj. Similarly, recordsbelonging to L4 (in Ty) are also similar to each other. If we take theintersection of L1 and L4 then we get their common records in theintersection, i.e. L1 \ L4 = {R1,R2,R3} (Fig. 1f). Here, R1, R2 and R3

share the same or similar values for the test attributes (Ai, Aj, Al

and Am) in both Tx and Ty, whereas records belonging to the leafL1 only share the same or similar values for the test attributes(Ai, and Aj) in the single tree Tx. Therefore, we expect that recordsbelonging to a leaf have a strong similarity with high correlationsfor the attributes, and records belonging to an intersection haveeven stronger similarity and higher correlations.

We now empirically analyse the concepts on a real data set.Applying C4.5 algorithm [15] on the Credit Approval (CA) dataset [21] we build a decision tree, which happens to have sevenleaves. We then compute the correlations among the attributesfor all records of Credit Approval, and for the records belongingto each leaf separately. That is, first we compute the correlationCW

ij between two attributes, say Ai and Aj for the whole data set(i.e. for all records). We then compute the correlation CL

ij betweenthe same attributes Ai and Aj using only the records within a leafL, instead of all records in the data set. If the absolute value of CL

ij

is greater than the absolute value of CWij then we consider CL

ij as abetter correlation than CW

ij .For the six numerical attributes of the CA data set there are 15

possible pairs (6 choose 2) among the attributes, and therefore 15possible correlations. Column B shows the number of pairs ofattributes Ai and Aj for a leaf L (shown in Column A) where weget CL

ij > CWij . For example, for Leaf 1 (as shown in Column A)

we get 11 pairs of attributes (out of 15 pairs) that have highercorrelations within the leaf than the whole data set. The firstthree columns, of Table 1, show that for all leaves (except Leaf2 and Leaf 3), at least 11 out of 15 pairs have higher correlationswithin a leaf. Leaf 2 and Leaf 3 have a small number of records, asshown in Column C.

A decision forest can be used for identifying records with evenbetter similarity (among the records) and higher correlations(among the attributes). To investigate the concept, we apply anexisting forest algorithm called SysFor [16] on the CA data set.For the purpose of demonstration of basic concepts, we build a for-est of only two trees from CA and then take the intersections of theleaves. Note that the leaves of a single tree do not contain any com-

Table 1Correlation analysis on the Credit Approval (CA) data set.

Leafno.

Number ofbettercorrelationswithin a leaf(out of 15correlations)

Numberofrecords

Intersectionno.

Number ofbettercorrelationswithin anintersection(out of 15correlations)

Numberofrecords

A B C D E F

1 11 218 1 13 92 3 31 3 12 1083 6 23 8 11 1514 11 52 11 15 25 12 223 16 12 1396 12 85 20 13 617 14 21 21 13 23

26 13 3033 15 341 12 10549 13 20

mon records. Therefore, we only take the intersection of two leavesLi and Lj belonging to two different trees Tx and Ty. Columns D, Eand F of Table 1 are derived from the intersections of leaves fromtwo different trees that are not shown due to space limitations.They show higher correlations within the intersections. For exam-ple, for Intersection 1 (shown in Column D) there are 13 pairs ofattributes (shown in Column E), out of 15 pairs, having higher cor-relations within the intersection than the correlations within thewhole data set. A number of intersections such as Intersection 2(not shown in the table) have only 1 or no records in them, forwhich correlation calculation is not possible.

We also calculate similarity among all records of the CA dataset, all records within each leaf separately and all records withineach intersection separately. Fig. 2 indicates that the average sim-ilarity of the records within an intersection is higher than the aver-age similarity of the records within a leaf. Additionally, the recordswithin a leaf have higher average similarity than the similarity ofthe records within the whole data set. Similarity calculation stepsare introduced in Step 4 of Section 2.3.

Often the intersections can have a small number of records. Werealize that the EMI based imputation can be less effective on sucha small sized intersection. In order to explore the impact of anintersection size on the imputation accuracy, we apply EMI on dif-ferent numbers of records of the Credit Approval and CMC data sets[21]. We first artificially create a missing value in a record, andthen create a sub-data set of size s by taking s-nearest neighbors(s-NNs) of the record having a missing value. We then apply EMIon each sub-data set to impute the missing value. Fig. 3 shows thatboth CMC and Credit Approval achieve high accuracy for s = 25 interms of RMSE evaluation criteria. The accuracy then does not in-crease drastically with the increase of size any more.

In order to avoid applying EMI on a small sized intersection, an-other basic concept of our technique is to merge a small sizedintersection with a suitable target intersection in such a way sothat the merged intersection produces the best possible correla-tions and average similarity among the records. The merging pro-cess is shown in Fig. 1f and g, where the record R10 is merged withthe records R8 and R9 in order to increase the similarity and corre-lations. The merging process is presented in details in Step 4 ofSection 2.3.

Whole data set Single tree Double trees0.8410

0.8415

0.8420

0.8425

Avg

. Sim

ilari

ty

Fig. 2. Similarity analysis on Credit Approval data set.

5 15 25 35 45 55 65 75 85 950.11

0.16

0.21

0.26

0.31

RM

SE

Data set size (τ)

RMSE: Lower value is better Credit Approval

CMC

Fig. 3. Justification of the minimum size of a leaf.

Page 4: Nuevas Tecnicas de Imputacion

54 M.G. Rahman, M.Z. Islam / Knowledge-Based Systems 53 (2013) 51–65

We argue that the accuracy of EMI may depend on three fac-tors: (a) size of an intersection, (b) similarity (i.e. inverse dis-tance) among the records of the intersection and (c)correlations among the attributes within the intersection. A suit-able size for an intersection can be 25 records as shown in Fig. 3.While merging a small sized intersection with a target intersec-tion we aim to increase both similarity and correlations. In orderto explore a balance between similarity and correlations we carryout another initial test, where we first introduce a Similarity–Cor-relation threshold k, the value of which varies between 0 and 1.Now, k = 1 indicates that the target intersection is selected basedon only the influence of similarity and k = 0 indicates that it is se-lected by considering only the correlations. We informally testthe impact of various k values on Credit Approval [21] by imput-ing missing values using our algorithm called SiMI, which weintroduce in detail in Section 2.3, with different k values. Fig. 4shows that we generally get the best imputation accuracy atk = 0.7 for R2, d2, RMSE and MAE evaluation criteria. Since k is aproportion between similarity and correlation we call it the Sim-ilarity–Correlation threshold.

Please note that in the above example, we use a forest of twotrees only to demonstrate the basic concepts. However, as shownin Algorithm 2 by default we use 5 trees unless a user defines itotherwise. In all our experiments forests of 5 trees are used. If weuse more trees then they will produce a bigger number of inter-sections where many of them will have a very low number of re-cords resulting in a higher number of intersections requiringmerging. Both the creation of a bigger number of intersectionsand merging many intersections will cause higher computationalcomplexity. Additionally, since many intersections will be mergedthere is no point in having the huge number of intersections. Wealso carry out empirical analysis on the CA and CMC data sets toevaluate the impact of the number of trees, k in terms of accuracyof imputation (RMSE), number of intersections (NOI) and execu-tion time (in ms). Fig. 5 shows that k = 5 gives an overall good re-sult on Credit Approval (CA) and CMC data sets.

Generally s = 25, k = 5, k = 0.7 and l = 100 can be a good option.However, these values can differ from data set to data set. A possi-ble solution for a user to make a sensible assumption on these val-

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.36

0.38

0.40

0.42

0.44

R2

Similarity−Correlation threshold, λ

R2: Higher value is better

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.09

0.10

0.11

0.12

0.13

0.14

Similarity−Correlation threshold, λ

RMSE: Lower value is better

Fig. 4. Analysis of a suitable value for the Similarity–Co

ues can be to try different values and compare the imputationaccuracies. For example, if a user has a data set DF with some miss-ing values then he/she can first produce a pure data set DC byremoving the records having missing values. The user can thenrandomly select attribute values of DC in order to artificially createmissing values where the original values are known to him/her.He/she can then use different values of s, k, k and l and imputethe artificially created missing values. The values of s, k, k and l thatproduce the best imputation accuracy can then be used in imput-ing the real missing values in DF. A user may run the evaluationmany times. For example, he/she can run it 30 times to create 30data sets with missing values. A similar approach was taken inthe literature to find a suitable k value for k-NN based imputation[22].

2.2. DMI

We first mention the main steps of DMI as follows and then ex-plain each of them in detail.

Step-1:Divide a full data set DF into two sub data sets DC and DI.

Step-2:Build a set of decision trees on DC.

Step-3:Assign each record of DI to the leaf where it falls in. Impute cate-gorical missing values.

Step-4:Impute numerical missing values using EMI algorithm within theleaves.

Step-5:Combine records to form a completed data set D0F

� �without any

missing values.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.73

0.74

0.75

0.76

0.77

0.78

d 2

Similarity−Correlation threshold, λ

d2: Higher value is better

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.07

0.08

0.09

0.10

Similarity−Correlation threshold, λ

MAE: Lower value is better

rrelation threshold k, on Credit Approval data set.

Page 5: Nuevas Tecnicas de Imputacion

2 3 4 5 6 7 8 9 100.150

0.155

0.160

0.165

0.170

0.175

0.180

RM

SE

2 3 4 5 6 7 8 9 101

10

100

1000

10000

Exe

cutio

n tim

e (m

s), a

nd N

OI

Number of decision trees, k

RMSE

NOI

Execution time (ms)

2 3 4 5 6 7 8 9 100.17

0.19

0.21

0.23

0.25

RM

SE

2 3 4 5 6 7 8 9 101

10

100

1000

10000

Exe

cutio

n tim

e (m

s), a

nd N

OI

Number of decision trees, k

RMSE

NOI

Execution time (ms)

Fig. 5. Analysis of a suitable value for the k (number of decision trees) in terms on RMSE, Number of intersections (NOI), and execution time (ms).

M.G. Rahman, M.Z. Islam / Knowledge-Based Systems 53 (2013) 51–65 55

Algorithm 1. DMI

Page 6: Nuevas Tecnicas de Imputacion

56 M.G. Rahman, M.Z. Islam / Knowledge-Based Systems 53 (2013) 51–65

Step-1:Divide a full data set DF into two sub data sets DC and DI.To impute missing values in a data set, we first divide the data setDF into two sub data sets DC and DI, where DC contains records hav-ing no missing values and DI contains records having missing val-ues (see Step-1 of Algorithm 1).

Step-2:Build a set of decision trees on DC.In this step we first identify attributes Ai (1 6 i 6M), where M isthe total number of attributes in DI, having missing values. Wemake a temporary copy of DC into D0C . For each attribute Ai we builda decision tree (considering Ai as the class attribute) from the subdata set D0C , using any existing algorithm (such as C4.5 [15]). If Ai isa numerical attribute, we first generalize Ai of D0C into NC categories,where NC is the squared root of the domain size of Ai (see Step-2 ofAlgorithm 1).In Step 2 of Algorithm 1, we also introduce a boolean variable Ij

with initial value ‘‘FALSE’’ for a data set dj belonging to a leaf, foreach leaf of all trees. The boolean variable is used in Step 4 to avoidunnecessary multiple imputation on a data set.

Step-3:Assign each record of DI to the leaf where it falls in. Impute cat-egorical missing values.Each record rk from the sub data set DI has missing value/s. If rk hasa missing value for the attribute Ai then we use the tree Ti (thatconsiders Ai as the class attribute) in order to identify the leaf Rj,where rk falls in. Note that rk falls in Rj if the test attribute valuesof Rj match the attribute values of rk. We add rk in the segment dj

representing Rj of Ti (see Step 3 of Algorithm 1). If Ai is a categoricalattribute then the missing value of Ai in rk is imputed by the major-ity class value of Ai in dj. Otherwise, we use the Step 4 to imputenumerical missing values.While identifying the leaf of Ti where rk falls in, if we encounter atest attribute the value of which is also missing in rk then we willnot be able to identify the exact leaf in Ti for rk. In that case we con-sider any leaf that belongs to the sub-tree of the test attribute thevalue of which is missing in rk.

Step-4:Impute numerical missing values using EMI algorithm withinthe leaves.We now impute numerical missing values for all records in DI oneby one. For a record rk we identify a numerical attribute Ai having amissing value. We also identify the data set dj where rk has beenadded (in Step 3) for the imputation of a missing value in thenumerical attribute Ai. If dj has not been imputed before (i.e. if Ij

is FALSE) then we apply the EMI algorithm [13] for imputationand thereby impute the values of Ai in dj (see Step-4 of Algorithm1).There are two problematic cases where the EMI algorithm needs tobe used carefully. First, the EMI algorithm does not work if allrecords have the same value for a numerical attribute. Second,the EMI algorithm is also not useful when, for a record, all numer-ical values are missing. DMI initially ignores the attribute havingthe same value for all records. It also ignores the records havingall numerical values missing. DMI then imputes all others valuesas usual. Finally, it uses the mean value of an attribute to imputea numerical missing value for a record having all numerical values

missing. It also uses the mean value of an attribute to impute amissing value belonging to an attribute having the same valuefor all records.

Step-5:Combine records to form a completed data set D0F

� �.

We finally combine DC and DI in order to form D0F , which is theimputed data set.

2.3. SiMI

We first introduce the main steps of SiMI as follows, and thenexplain each of them in detail.

Step-1:Divide a full data set (DF) into two sub data sets DC and DI.

Step-2:Build a decision forest using DC.

Step-3:Find the intersections of the records belonging to the leaves of theforest.

Step-4:Merge a small sized intersection with the most suitableintersection.

Step-5:Impute numerical missing values using EMI algorithm and categor-ical missing values using the most frequent values.

Step-6:Combine records to form a completed data set D0F

� �without any

missing values.

Step-1:Divide a full data set (DF) into two sub data sets DC and DI.Similar to DMI, we first divide a data set DF into two sub data setsDC and DI (see Step-1 of Algorithm 2). In this step, we also use anexisting algorithm [23] to estimate the similarity between a pairof categorical values belonging to an attribute, for all pairs andfor all categorical attributes. The similarity of a pair of values canbe anything between 0 and 1, where 0 means no similarity and 1means a perfect similarity.

Step-2:Build a decision forest on DC.We build a set of k number of decision trees T = {T1, T2, � � � , Tk} onDC using a decision forest algorithm such as SysFor [16], as shownin Step-2 of Algorithm 2. k is a user defined number.

Step-3:Find the intersections of the records belonging to the leaves ofthe forest.Let the number of leaves for T1, T2, � � � , Tk be L1, L2, � � � , Lk. The totalnumber of possible intersections of all trees is L1 � L2 � � � � � Lk.We then ignore the intersections that do not have any records init. See Algorithm 2 and Procedure Splitting (see Algorithm 3).

Page 7: Nuevas Tecnicas de Imputacion

Table 2Data sets at a glance.

Data set Records Num. attr. Cat. attr. Missing Pure records Data set Records Num. attr. Cat. attr. Missing Pure records

Adult 32561 6 9 Yes 30,162 Pima 768 8 1 No 768Chess 28,056 3 4 No 28,056 Credit Approval 690 6 10 Yes 653Yeast 1484 8 1 No 1484 Housing 506 11 3 No 506CMC 1473 2 8 No 1473 Autompg 398 5 3 Yes 392GermanCA 1000 7 14 No 1000

Table 3Average performance on Yeast, Adult, GermanCA and CMC data sets.

Data set Missing combination R2 (higher value is better) d2 (higher value is better) RMSE (lower value is better) MAE (lower value is better)

SiMI DMI EMI IBLLS SiMI DMI EMI IBLLS SiMI DMI EMI IBLLS SiMI DMI EMI IBLLS

Chess Missing ratio 1% 0.196 0.149 0.097 0.066 0.586 0.493 0.432 0.431 0.271 0.309 0.318 0.490 0.227 0.262 0.272 0.3873% 0.180 0.150 0.097 0.065 0.544 0.494 0.432 0.428 0.273 0.308 0.318 0.489 0.230 0.262 0.272 0.3865% 0.181 0.151 0.096 0.065 0.530 0.496 0.432 0.427 0.266 0.308 0.318 0.491 0.224 0.261 0.272 0.38910% 0.166 0.150 0.097 0.065 0.507 0.495 0.432 0.429 0.275 0.308 0.318 0.486 0.231 0.261 0.272 0.384

Missing model Overall 0.178 0.149 0.096 0.065 0.541 0.494 0.432 0.428 0.266 0.308 0.318 0.489 0.224 0.261 0.272 0.387UD 0.184 0.150 0.098 0.065 0.543 0.495 0.432 0.429 0.277 0.309 0.318 0.488 0.232 0.262 0.272 0.386

Missing pattern Simple 0.202 0.167 0.100 0.064 0.582 0.518 0.436 0.440 0.248 0.305 0.318 0.542 0.208 0.262 0.272 0.425Medium 0.180 0.153 0.098 0.064 0.535 0.498 0.433 0.424 0.272 0.308 0.318 0.483 0.230 0.262 0.272 0.382Complex 0.164 0.130 0.095 0.067 0.509 0.469 0.428 0.424 0.286 0.312 0.318 0.441 0.241 0.261 0.272 0.352Blended 0.178 0.149 0.095 0.065 0.541 0.493 0.430 0.426 0.279 0.308 0.318 0.490 0.233 0.261 0.272 0.388

Yeast Missing ratio 1% 0.794 0.772 0.750 0.747 0.949 0.933 0.910 0.919 0.108 0.115 0.122 0.121 0.065 0.069 0.076 0.0813% 0.796 0.790 0.754 0.752 0.945 0.940 0.912 0.920 0.106 0.111 0.116 0.122 0.065 0.068 0.074 0.0875% 0.785 0.779 0.751 0.702 0.941 0.936 0.910 0.898 0.108 0.114 0.118 0.134 0.066 0.070 0.075 0.09510% 0.780 0.780 0.752 0.681 0.937 0.936 0.909 0.883 0.111 0.113 0.118 0.140 0.067 0.070 0.075 0.103

Missing model Overall 0.791 0.783 0.753 0.726 0.943 0.937 0.911 0.906 0.108 0.113 0.118 0.128 0.066 0.069 0.075 0.091UD 0.787 0.777 0.750 0.715 0.943 0.935 0.909 0.904 0.108 0.114 0.119 0.130 0.066 0.069 0.075 0.092

Missing pattern Simple 0.807 0.789 0.757 0.739 0.949 0.939 0.913 0.911 0.103 0.110 0.115 0.124 0.062 0.069 0.074 0.089Medium 0.786 0.783 0.754 0.744 0.943 0.937 0.911 0.915 0.110 0.113 0.118 0.124 0.066 0.068 0.074 0.088Complex 0.762 0.769 0.746 0.692 0.933 0.933 0.907 0.895 0.117 0.116 0.121 0.136 0.071 0.070 0.077 0.095Blended 0.800 0.779 0.750 0.707 0.948 0.936 0.910 0.899 0.103 0.114 0.119 0.133 0.064 0.070 0.075 0.094

Adult Missing ratio 1% 0.803 0.753 0.750 0.520 0.948 0.923 0.887 0.840 0.092 0.125 0.126 0.184 0.066 0.082 0.082 0.1313% 0.792 0.753 0.751 0.391 0.947 0.921 0.887 0.757 0.099 0.125 0.126 0.215 0.070 0.085 0.085 0.1645% 0.785 0.749 0.748 0.341 0.953 0.919 0.885 0.741 0.099 0.127 0.127 0.219 0.067 0.088 0.086 0.16710% 0.785 0.751 0.749 0.250 0.942 0.918 0.885 0.678 0.099 0.127 0.127 0.236 0.069 0.089 0.087 0.181

Missing model Overall 0.796 0.752 0.749 0.390 0.951 0.921 0.886 0.768 0.093 0.126 0.126 0.213 0.066 0.085 0.085 0.159UD 0.787 0.751 0.750 0.362 0.944 0.920 0.886 0.740 0.101 0.126 0.126 0.214 0.070 0.086 0.085 0.161

Missing pattern Simple 0.839 0.755 0.753 0.527 0.969 0.921 0.887 0.811 0.074 0.126 0.126 0.190 0.048 0.082 0.086 0.149Medium 0.770 0.752 0.750 0.392 0.936 0.920 0.886 0.762 0.108 0.126 0.126 0.212 0.077 0.085 0.086 0.158Complex 0.759 0.749 0.747 0.224 0.922 0.921 0.885 0.703 0.135 0.125 0.127 0.239 0.089 0.088 0.084 0.174Blended 0.798 0.751 0.749 0.359 0.963 0.920 0.886 0.740 0.081 0.126 0.126 0.214 0.058 0.089 0.086 0.161

GermanCA Missing ratio 1% 0.443 0.431 0.366 0.283 0.797 0.769 0.732 0.699 0.242 0.259 0.287 0.304 0.179 0.190 0.229 0.2323% 0.421 0.402 0.361 0.249 0.772 0.759 0.722 0.678 0.247 0.269 0.286 0.317 0.187 0.200 0.228 0.2475% 0.436 0.401 0.359 0.211 0.777 0.759 0.721 0.656 0.240 0.268 0.286 0.326 0.180 0.198 0.229 0.25510% 0.442 0.392 0.359 0.216 0.772 0.749 0.723 0.651 0.239 0.270 0.284 0.322 0.180 0.201 0.228 0.255

Missing model Overall 0.433 0.406 0.362 0.241 0.778 0.758 0.723 0.671 0.243 0.268 0.287 0.318 0.183 0.198 0.229 0.248UD 0.438 0.407 0.360 0.239 0.781 0.760 0.726 0.672 0.240 0.266 0.285 0.316 0.180 0.197 0.227 0.247

Missing pattern Simple 0.502 0.419 0.369 0.350 0.808 0.772 0.738 0.726 0.223 0.263 0.280 0.278 0.160 0.190 0.222 0.210Medium 0.403 0.426 0.364 0.258 0.768 0.768 0.729 0.681 0.251 0.262 0.285 0.311 0.193 0.200 0.228 0.245Complex 0.386 0.384 0.353 0.126 0.755 0.741 0.706 0.616 0.260 0.272 0.291 0.358 0.201 0.198 0.235 0.281Blended 0.452 0.397 0.359 0.226 0.788 0.756 0.724 0.662 0.233 0.270 0.286 0.321 0.172 0.201 0.229 0.253

M.G. Rahman, M.Z. Islam / Knowledge-Based Systems 53 (2013) 51–65 57

Page 8: Nuevas Tecnicas de Imputacion

58 M.G. Rahman, M.Z. Islam / Knowledge-Based Systems 53 (2013) 51–65

Algorithm 2. SiMI

Page 9: Nuevas Tecnicas de Imputacion

Tabl

e4

Ave

rage

perf

orm

ance

onPi

ma,

Hou

sing

,Aut

ompg

and

Cred

itA

ppro

val

data

sets

.

Dat

ase

tM

issi

ng

com

bin

atio

nR2

(hig

her

valu

eis

bett

er)

d 2(h

igh

erva

lue

isbe

tter

)R

MSE

(low

erva

lue

isbe

tter

)M

AE

(low

erva

lue

isbe

tter

)

SiM

ID

MI

EMI

IBLL

SSi

MI

DM

IEM

IIB

LLS

SiM

ID

MI

EMI

IBLL

SSi

MI

DM

IEM

IIB

LLS

Pim

aM

issi

ng

rati

o1%

0.67

30.

646

0.55

80.

538

0.90

30.

885

0.84

20.

789

0.14

00.

147

0.17

10.

184

0.10

00.

105

0.13

10.

147

3%0.

661

0.62

90.

568

0.54

30.

894

0.87

90.

840

0.78

70.

141

0.15

10.

170

0.18

10.

103

0.11

00.

131

0.14

85%

0.66

30.

610

0.56

40.

522

0.89

30.

870

0.83

80.

779

0.14

00.

157

0.17

30.

185

0.10

20.

116

0.13

30.

153

10%

0.65

10.

607

0.56

20.

512

0.88

50.

865

0.83

70.

772

0.13

90.

158

0.17

30.

186

0.10

30.

117

0.13

40.

155

Mis

sin

gm

odel

Ove

rall

0.66

10.

621

0.56

20.

528

0.90

60.

874

0.83

90.

781

0.13

30.

154

0.17

20.

184

0.10

30.

112

0.13

20.

151

UD

0.66

30.

625

0.56

50.

529

0.89

50.

875

0.84

00.

782

0.13

90.

152

0.17

20.

184

0.10

10.

112

0.13

20.

151

Mis

sin

gpa

tter

nSi

mpl

e0.

687

0.60

50.

575

0.55

10.

906

0.86

80.

843

0.79

00.

133

0.15

80.

167

0.17

90.

094

0.10

50.

128

0.14

5M

ediu

m0.

646

0.63

20.

565

0.54

20.

885

0.87

70.

840

0.78

60.

147

0.15

20.

173

0.18

20.

109

0.11

00.

133

0.15

0C

ompl

ex0.

654

0.63

50.

548

0.48

60.

889

0.87

90.

834

0.76

50.

142

0.14

90.

175

0.19

30.

104

0.11

60.

135

0.15

9B

len

ded

0.66

10.

620

0.56

50.

535

0.89

50.

875

0.84

10.

785

0.13

90.

153

0.17

30.

183

0.10

00.

117

0.13

30.

150

Hou

sin

gM

issi

ng

rati

o1%

0.84

90.

806

0.77

90.

802

0.95

90.

940

0.92

90.

942

0.12

10.

152

0.15

50.

143

0.07

80.

098

0.11

00.

088

3%0.

812

0.79

60.

770

0.72

60.

942

0.93

40.

927

0.91

60.

138

0.15

20.

160

0.17

20.

094

0.10

20.

115

0.11

15%

0.78

80.

756

0.76

50.

698

0.93

10.

923

0.92

30.

905

0.14

40.

160

0.16

20.

183

0.10

00.

112

0.11

80.

121

10%

0.78

40.

726

0.75

00.

638

0.92

70.

915

0.91

50.

880

0.14

40.

166

0.16

70.

199

0.10

10.

124

0.12

40.

138

Mis

sin

gm

odel

Ove

rall

0.80

60.

771

0.76

80.

728

0.93

80.

928

0.92

40.

914

0.13

70.

157

0.16

00.

171

0.09

50.

110

0.11

70.

113

UD

0.81

00.

771

0.76

40.

704

0.94

10.

928

0.92

30.

907

0.13

60.

158

0.16

20.

178

0.09

20.

108

0.11

70.

115

Mis

sin

gpa

tter

nSi

mpl

e0.

846

0.80

80.

801

0.76

00.

957

0.93

60.

937

0.92

60.

124

0.14

80.

148

0.16

10.

084

0.09

80.

108

0.10

9M

ediu

m0.

806

0.75

50.

765

0.75

80.

938

0.92

50.

923

0.92

60.

140

0.16

00.

162

0.16

20.

084

0.10

20.

116

0.10

5C

ompl

ex0.

783

0.74

70.

738

0.65

70.

928

0.92

00.

912

0.89

10.

145

0.16

60.

172

0.19

20.

102

0.11

20.

126

0.12

5B

len

ded

0.79

70.

773

0.75

90.

688

0.93

60.

930

0.92

20.

900

0.13

80.

155

0.16

20.

182

0.09

40.

124

0.11

60.

118

Au

tom

pgM

issi

ng

rati

o1%

0.85

90.

851

0.81

80.

765

0.95

80.

949

0.93

50.

898

0.08

80.

088

0.09

90.

118

0.06

40.

067

0.07

50.

090

3%0.

843

0.81

80.

800

0.74

90.

949

0.94

30.

933

0.91

20.

090

0.09

80.

105

0.11

70.

067

0.07

30.

080

0.08

25%

0.85

70.

803

0.78

70.

747

0.95

50.

938

0.93

10.

919

0.08

10.

100

0.10

50.

114

0.06

10.

075

0.08

00.

081

10%

0.85

50.

802

0.78

40.

714

0.95

70.

937

0.92

90.

908

0.08

20.

103

0.10

80.

124

0.06

00.

077

0.08

30.

088

Mis

sin

gm

odel

Ove

rall

0.85

90.

819

0.79

90.

749

0.95

80.

942

0.93

40.

916

0.08

30.

097

0.10

30.

116

0.06

10.

072

0.07

80.

082

UD

0.84

70.

817

0.79

60.

739

0.95

20.

942

0.93

10.

902

0.08

70.

097

0.10

50.

121

0.06

50.

073

0.08

10.

088

Mis

sin

gpa

tter

nSi

mpl

e0.

902

0.87

20.

846

0.84

00.

983

0.96

30.

953

0.94

60.

068

0.08

30.

091

0.09

40.

044

0.06

70.

069

0.06

9M

ediu

m0.

847

0.84

00.

813

0.78

30.

952

0.95

20.

938

0.92

40.

090

0.09

20.

102

0.11

10.

068

0.07

30.

078

0.08

0C

ompl

ex0.

789

0.73

90.

727

0.61

10.

916

0.90

90.

902

0.85

80.

108

0.11

80.

122

0.15

00.

087

0.07

50.

095

0.10

8B

len

ded

0.87

60.

822

0.80

30.

742

0.96

80.

944

0.93

60.

909

0.07

60.

096

0.10

10.

119

0.05

30.

077

0.07

60.

085

Cre

dit

appr

oval

Mis

sin

gra

tio

1%0.

472

0.45

70.

446

0.37

10.

808

0.76

60.

757

0.72

80.

085

0.11

30.

112

0.12

60.

068

0.07

40.

076

0.07

93%

0.43

50.

414

0.41

00.

306

0.78

10.

757

0.75

00.

719

0.09

40.

121

0.12

00.

140

0.07

40.

077

0.07

90.

087

5%0.

425

0.41

60.

415

0.32

90.

764

0.76

00.

754

0.73

70.

099

0.12

20.

121

0.13

90.

077

0.07

80.

079

0.08

810

%0.

428

0.41

20.

420

0.30

90.

737

0.76

20.

759

0.72

30.

103

0.12

30.

120

0.14

10.

080

0.08

00.

079

0.09

0M

issi

ng

mod

elO

vera

ll0.

443

0.41

90.

417

0.32

30.

781

0.75

90.

752

0.72

40.

094

0.12

00.

118

0.13

70.

074

0.07

80.

078

0.08

6U

D0.

437

0.43

10.

429

0.33

40.

777

0.76

30.

758

0.73

00.

097

0.11

90.

119

0.13

70.

076

0.07

70.

078

0.08

6M

issi

ng

patt

ern

Sim

ple

0.49

00.

448

0.45

40.

368

0.82

00.

779

0.77

80.

746

0.08

40.

117

0.11

40.

131

0.06

80.

074

0.07

50.

084

Med

ium

0.44

30.

445

0.44

20.

361

0.78

20.

773

0.76

60.

746

0.09

30.

119

0.11

80.

132

0.07

40.

077

0.07

80.

084

Com

plex

0.39

40.

403

0.39

90.

276

0.73

60.

744

0.73

60.

697

0.10

90.

121

0.12

10.

146

0.08

30.

078

0.08

00.

089

Ble

nde

d0.

432

0.40

20.

396

0.31

00.

778

0.74

80.

740

0.71

80.

096

0.12

10.

121

0.13

70.

075

0.08

00.

079

0.08

6

CM

CM

issi

ng

rati

o1%

0.57

80.

562

0.54

20.

313

0.85

00.

823

0.81

20.

711

0.14

90.

172

0.17

60.

272

0.11

40.

133

0.13

80.

206

3%0.

522

0.50

30.

485

0.26

50.

831

0.80

70.

795

0.69

80.

148

0.17

80.

182

0.27

90.

114

0.13

70.

141

0.21

15%

0.51

60.

496

0.48

00.

253

0.83

40.

807

0.79

50.

694

0.14

80.

178

0.18

20.

279

0.11

30.

137

0.14

00.

209

10%

0.53

40.

495

0.48

40.

247

0.84

10.

806

0.79

50.

693

0.14

10.

180

0.18

20.

276

0.11

00.

138

0.14

00.

206

Mis

sin

gm

odel

Ove

rall

0.53

90.

510

0.49

50.

263

0.84

00.

808

0.79

70.

694

0.14

80.

178

0.18

10.

278

0.11

30.

136

0.13

90.

208

UD

0.53

70.

518

0.50

10.

276

0.83

80.

814

0.80

20.

705

0.14

50.

176

0.18

00.

275

0.11

20.

136

0.14

00.

207

Mis

sin

gpa

tter

nSi

mpl

e0.

592

0.57

70.

553

0.50

80.

865

0.84

60.

830

0.82

50.

134

0.16

40.

169

0.19

40.

106

0.13

30.

129

0.14

8M

ediu

m0.

544

0.52

10.

502

0.24

10.

844

0.81

50.

802

0.70

30.

144

0.17

70.

181

0.27

90.

111

0.13

70.

139

0.20

6C

ompl

ex0.

475

0.44

80.

440

0.11

20.

799

0.77

10.

766

0.58

30.

163

0.18

80.

190

0.34

90.

124

0.13

70.

149

0.26

7B

len

ded

0.54

00.

510

0.49

70.

217

0.84

90.

810

0.79

80.

686

0.14

50.

179

0.18

20.

285

0.11

10.

138

0.14

10.

210

M.G. Rahman, M.Z. Islam / Knowledge-Based Systems 53 (2013) 51–65 59

Page 10: Nuevas Tecnicas de Imputacion

Table 5Number of times a technique performs the best, out of 288 cases (32 missing combinations per data set x 9 data sets).

R2 d2 RMSE MAE

SiMI DMI EMI IBLLS SiMI DMI EMI IBLLS SiMI DMI EMI IBLLS SiMI DMI EMI IBLLS

Total 243 45 0 0 246 42 0 0 263 24 1 0 248 37 3 0

60 M.G. Rahman, M.Z. Islam / Knowledge-Based Systems 53 (2013) 51–65

Algorithm 3. Procedure Splitting ()

Algorithm 4. Procedure Merging ()

Step-4:Merge a small sized intersection with the most suitable inter-section.We first normalize the numerical values of the whole data set Dinto D0, and thereby convert the domain of each numerical attri-bute to [0,1]. The CreateIntersections() function (see Algorithm 4)

re-creates all intersections Y 0i;8i from D0. If the size of the smallestintersection is smaller than a user defined intersection size thresh-old s then we merge the smallest intersection with the most suit-able intersection. The most suitable intersection is the one thatmaximizes a metric Sj (see Algorithm 4) based on the similarityamong the records and correlations among the attributes withinthe merged intersection.

Page 11: Nuevas Tecnicas de Imputacion

Fig. 6. 95% Confidence interval of d2 values on nine data sets in terms of ‘‘Simple’’ missing pattern.

M.G. Rahman, M.Z. Islam / Knowledge-Based Systems 53 (2013) 51–65 61

Similarity Wj between two intersections is calculated using theaverage distance of the pairs of records, where the records of a pairbelong to two intersections. Distance between two records is cal-

culated using the Euclidean distance for numerical attributes anda similarity based distance for categorical attributes. If similaritybetween two categorical values i and j is sij then their distance

Page 12: Nuevas Tecnicas de Imputacion

62 M.G. Rahman, M.Z. Islam / Knowledge-Based Systems 53 (2013) 51–65

dij = 1 � sij. Similarity of categorical values belonging to an attri-bute varies between 0 and 1, and is estimated based on an existingtechnique [23].Once two intersections are merged, we repeat the same process ifthe number of records in the smallest intersection is smaller thans. L2 norm [24] of the correlation matrix of a merged intersection is

calculated by

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiPr

i¼1

Pc

j¼1x2

i;j

q

jAj�jAj , where r is the number of rows of the

correlation matrix, j is the number of columns, xi,j is the value ofthe ith row and jth column of the correlation matrix, and jAj isthe number of attributes of the data set.

Step-5:Impute numerical missing values using EMI algorithm and cat-egorical missing values using the most frequent values.Missing values of all records in DI are imputed one by one. For arecord rk 2 DI, we first identify the intersection Yj 2 Y, where rk

belongs to. If a missing attribute Ai is numerical then we applythe EMI algorithm [13] for imputing all numerical missing valuesin Yj. If the attribute Ai is categorical, we use the most frequentvalue of Ai in Yj as the imputed value of Ai.

Step-6:We finally combine DC and DI in order to form a completed data setD0F without any missing values.

2.4. Complexity analysis

We now analyse complexity for DMI, SiMI, EMI and IBLLS. Weconsider that we have a data set with n records, and m attributes.We also consider that there are m0 attributes with missing valuesover the whole data set, nI records with one or more missing val-ues, and nc records (nc = n � nI) with no missing values. DMI andSiMI uses the C4.5 algorithm [25,15] to build decision trees, whichrequires a user input on the minimum number of records in a leaf.We consider this to be l.

First we analyse the complexity of DMI. Step 2 of the DMI algo-rithm uses the C4.5 algorithm which has a complexity O(ncm

2)[26], if it is applied on nc records and m attributes. Step 4 usesthe EMI algorithm which has the complexity O(nm2 + m3), if EMIis applied on n records and m attributes. However, in Step 4 EMIis applied repeatedly on different data segments representing theleaves or logic rules. In the worst case scenario for the ith treewe can have at most nc

l leaves where the minimum number of re-cords for a leaf is l. If nI 6

ncl then EMI is applied at most nI times,

else if nI >ncl then it is applied at most nc

l times. Therefore, the max-imum times EMI can be applied is nc

l resulting in the maximum

complexity for EMI to be O ncl ðlm

2 þm3Þ� �

. Considering that there

are m0 attributes with missing values resulting in m0 trees (follow-ing the for loops of Step 4) the complexity of the step is

O m0ncl ðlm

2 þm3Þ� �

.

Generally, l is chosen to be a small number which is signifi-cantly smaller than nc. Therefore, we can ignore l from the com-plexity analysis. Moreover, we assume that nI� nc, and nc � n.Therefore the overall complexity for Algorithm 1 can be estimatedas O(n2mm0 + nm3m0). For typical data sets (such as those used inthe experiments of this study) having n�m the complexity isO(n2). However, for the data sets having m P n the complexity isO(m3) if m0 �m or O(m4) if m0 �m.

We now analyse the complexity of SiMI algorithm (see Algo-rithm 2). Complexity of Step 1 for preparing DI and DC is O(nm).However, the complexity of similarity calculation [23] in Step 1is O(nm2d2 + m3d5), where we consider that the domain size of eachattribute is d.

The overall complexity of Step 2 is O ncm2kþ n2c mk

l

� �. Step 3 (see

Algorithm 3) finds the non-empty intersections of the leaves for ktrees. However, note that the number of non-empty intersectionscan only be at most n in the worst case scenario, and therefore L(in the algorithm) can be at most equal to n. We consider that eachleaf has the minimum number of records (i.e. l) in order to allow usto consider the worst case scenario where the number of leaves isthe maximum i.e. n

l . Therefore, the complexity of this step isO(n2lk).

Step 4 (see Algorithm 4) can have two extreme situations whereinitially we can have n number of intersections, each having 1 re-cord. After merging the records we may eventually end up in an-other extreme scenario where we have only two intersections;one having s records and the other one having (n � s) records. Con-sidering both situations in two iterations of the while loop we findthe worst case complexity of the step is O(n2m + nm2).

In Step 5 we can have at most ns number of intersections alto-

gether. Therefore, EMI algorithm may need to be applied at mostns times resulting in the complexity of the step to beO n

s ðsm2 þm3Þ� �

. Considering s to be very small the complexity ofthe Step is O(nm3). The complexity of Step 6 is O(nm).

Typically, d, k, l and s values are very small, especially comparedto n. Besides, we can also consider nI to be very small and thereforenc � n. Hence, the overall complexity of SiMI is O(n2m + nm3).Moreover, for low dimensional data sets such as those used in thisstudy the complexity is O(n2). We estimate the complexities of EMIand IBLLS (i.e. the techniques that we use in the experiments of thisstudy) as O(nm2 + m3) and O(n3m2 + nm4), respectively. This is alsoreflected in the execution time complexity analysis in the next sec-tion (see Table 6).

3. Experimental result

We implement our imputation techniques i.e. DMI and SiMI andtwo existing techniques, namely EMI [13] and IBLLS [6]. The exist-ing techniques were shown in the literature to be better than manyother techniques including Bayesian principal component analysis(BPCA) [27], LLSI [7], and ILLSI [28].

3.1. Data sets

We apply the techniques on nine real life data sets as shown inTable 2. The data sets are publicly available in UCI Machine Learn-ing Repository [21]. There are some data sets that already containmissing values as indicated by the column called ‘‘Missing’’ in Ta-ble 2. For the purpose of our experiments, we first need a data setwithout any missing values to start with. Therefore, we remove therecords having any missing values in order to prepare a data setwithout any missing values. We then artificially create missing val-ues, in the data set, which are imputed by the different techniques.Since the original values of the artificially created missing data areknown to us, we can easily evaluate the accuracy/performance ofthe imputation techniques.

3.2. Simulation of missing values

The performance of an imputation technique typically dependson both the characteristics/patterns of missing data and theamount of missing data [17]. Therefore, we use various types ofmissing values [14,17,29] and different amount of missing valuesas explained below.

We consider the simple, medium, complex and blended missingpatterns. In a simple pattern we consider that a record can have atmost one missing value, whereas in a medium pattern if a recordhas any missing values then it has minimum 2 attributes with

Page 13: Nuevas Tecnicas de Imputacion

M.G. Rahman, M.Z. Islam / Knowledge-Based Systems 53 (2013) 51–65 63

missing values and can have up to 50% of the attributes with miss-ing values. Similarly, a record having missing values in a complexpattern has minimum 50% and maximum 80% attributes withmissing values. In a blended pattern we have a mixture of recordsfrom all three other patterns. A blended pattern contains 25%, 50%,and 25% records having missing values in the simple pattern, med-ium pattern and complex pattern, respectively [17]. Additionally,for each of the missing patterns, we use different missing ratios(1%, 3%, 5% and 10%) where x% missing ratio means x% of the totalattribute values (not records) of a data set are missing.

We also use two types of missing models namely Overall andUniformly Distributed (UD) [14]. In the overall model, missing val-ues are not necessarily equally spread out among the attributes,and in the worst case scenario all missing values of a data setcan even belong to a single attribute. However, in the UD modeleach attribute has equal number of missing values.

Note that there are 32 combinations (4 missing patterns � 4missing ratios � 2 missing models) of the types of missing values.For each combination, missing values are created randomly. Due tothe randomness, every time we create missing values in a data setwe are likely to have a different set of missing values. Therefore, foreach combination we create 10 data sets with missing values, i.e.we create all together 320 data sets (i.e. 32 combinations � 10 datasets per combination) with missing values for each natural dataset.

3.3. Result analysis for all data sets

Due to the space limitation we present summarized results inTables 3 and 4. For each data set, we aggregate the results basedon missing ratios, missing models, and missing patterns. Aggre-gated result for 1% missing ratio means the average value for all

Fig. 7. Performance compar

combinations having 1% missing ratio, any missing pattern andany missing model. Bold values represent the best results andthe values with italic font represent the second best results. It isclear from the tables that SiMI performs significantly better thanother three techniques. SiMI and DMI never lose to EMI and IBLLSeven for a single case for R2, d2 and RMSE. They only lose twice forMAE.

Since due to space limitations we cannot present the detailednon-aggregated result, we present Table 5 that displays the num-ber of times a technique performs the best for an evaluation crite-ria. There are nine data sets and 32 combinations per data set,resulting in 288 combinations altogether. SiMI performs the bestin 243 out of 288 combinations for R2. DMI performs the best inthe remaining 45 combinations.

In Fig. 6 we present line graphs for average d2 values and 95%confidence intervals for all four techniques and all combinationswith the Simple pattern. There are only three combinations thatare encircled, where SiMI has either ‘‘lower averge’’, or ‘‘higheraverage, but overlapping confidence intervals’’ with DMI. Obvi-ously, DMI performs the second best. Higher average d2 valuesand non-overlapping confidence intervals indicate a statisticallysignificant superiority of SiMI and DMI over existing techniques.Confidence interval analysis for all other patterns (not presentedhere) also demonstrates a similar trend. We also perform a t-testanalysis, at p = 0.0005, suggesting a significantly better result bySiMI and DMI over EMI and IBLLS. For the comparison betweenSiMI and IBLLS, the t values are higher than the t-ref value for allevaluation criteria on all data sets. For the comparison betweenSiMI and EMI, the t values are also higher than the t-ref value forall evaluation criteria, except some evaluation criteria on theCMC and CA data sets.

ison on nine data sets.

Page 14: Nuevas Tecnicas de Imputacion

Fig. 8. Performance comparison for categorical imputation on nine data sets.

64 M.G. Rahman, M.Z. Islam / Knowledge-Based Systems 53 (2013) 51–65

Fig. 7 presents the average performance indicators for 320 datasets (i.e. 32 missing combinations � 10 data sets/combinations) forall techniques. SiMI and DMI clearly perform better than the exist-ing techniques.

3.4. Experimentation on categorical missing value imputation for alldata sets

Another advantage of SiMI and DMI is that unlike EMI andIBLLS, both of them can impute categorical missing values in addi-tion to numerical missing values. Therefore, we also compare theimputation accuracy of SiMI and DMI based on RMSE and MAEfor categorical missing values as shown in Fig. 8. In the figure wepresent the overall average (for all 32 combinations) of RMSE andMAE values for SiMI and DMI on all nine data sets.

3.5. Execution time complexity analysis

We now present the average execution time (in milliseconds)for 320 data sets (32 combinations � 10 data sets per combination)for each real data set in Table 6. We carry out the experimentsusing two different machines. However, for one data set we usethe same machine for all techniques. The configuration of Machine1 is 4 � 8 core Intel E7-8837 Xeon processors, 256 GB RAM. Theconfiguration of Machine 2 is Intel Core i5 processor with2.67 GHz speed and 4 GB RAM. Both SiMI and DMI take less timethan IBLLS, whereas they take more time than EMI to pay the costof a significantly better quality imputation.

4. Conclusion

We argue that real life data sets are likely to have horizontalsegments, where records within a segment have higher similarityand attribute correlations than the similarity and correlations ofthe whole data set. We empirically justify that decision trees andforests can be used to explore such segments. Additionally, theuse of the segments for imputing missing values can improve theimputation accuracy. However, if the size of a segment is very

Table 6Average execution time (in milliseconds) of different techniques on the nine data sets.

Data set SiMI DMI EMI IBLLS Machineused

Adult 1,452,890 1,302,609 82,189 53,947,274 Machine 1Chess 572,701 153,915 8667 15,537,849 Machine 1Yeast 5270 3024 92 173,209 Machine 1CMC 36,464 40,195 469 233,994 Machine 2GermanCA 2798 11,225 62 58,044 Machine 1Pima 2614 13,406 47 39,443 Machine 1Credit Approval 8550 30,252 1175 390,463 Machine 2Housing 99,109 88,322 3087 1,268,431 Machine 2Autompg 523 2215 18 8861 Machine 1

small then it can result in a low imputation accuracy. Therefore,we use another novel approach to merge a small sized segmentwith a suitable segment in such a way so that the similarity amongthe records and correlations among the attributes increase withinthe merged segment. Numerical and categorical missing valuesare then imputed using the records of a segment.

Both of our proposed techniques have basic differences with theexisting techniques that we have used in this study. An existingtechnique called EMI imputes a missing value by using all recordsof a data set. Whereas the proposed techniques (DMI and SiMI) im-pute a missing value by using only a group of records (instead of allrecords) that are similar to the record having the missing value.Another existing technique IBLLS also imputes a missing value byusing the similar records. However, it also uses only a subset ofall the attributes where the correlation between the attribute hav-ing the missing value and each attribute belonging to the subset ishigh. It completely ignores all other attributes having a correlationlower than a threshold value. On the other hand, instead of totallyignoring the attributes with low correlation both DMI and SiMIconsider all attributes according to their correlation strength. Anattribute having higher correlation has higher influence in theimputation than another attribute having a lower correlation.Additionally, IBLLS uses k-NN approach to find the similar records,whereas our techniques use decision trees and forests for findingsimilar records. The correlations of the attributes and the similar-ities of the records are generally high when we use trees and for-ests to find the similar records.

In addition to two novel imputation techniques, we also presenta complexity analysis of our techniques. Nine publicly availabledata sets are used for experimentation which suggests a clearsuperiority of our techniques as indicated in Table 5, and Fig. 7,along with other tables and figures. However, both DMI and SiMIhave complexity of O(n2) for low dimensional data sets, andO(m3) for high dimensional data sets. Therefore, our future re-search plans include a further improvement of our techniques forachieving a better computational complexity, and memory usage.We also plan to explore the suitability of our algorithms for parallelprocessing.

References

[1] X. Zhu, S. Zhang, Z. Jin, Z. Zhang, Z. Xu, Missing value estimation for mixed-attribute data sets, IEEE Transactions on Knowledge and Data Engineering 23(1) (2011) 110–121.

[2] A. Farhangfar, L. Kurgan, J. Dy, Impact of imputation of missing values onclassification error for discrete data, Pattern Recognition 41 (12) (2008) 3692–3705.

[3] A. Farhangfar, L.A. Kurgan, W. Pedrycz, A novel framework for imputation ofmissing values in databases, IEEE Transactions on Systems, Man andCybernetics, Part A: Systems and Humans 37 (5) (2007) 692–709.

[4] G. Batista, M. Monard, An analysis of four missing data treatment methods forsupervised learning, Applied Artificial Intelligence 17 (5-6) (2003) 519–533.

[5] C. Liu, D. Dai, H. Yan, The theoretic framework of local weighted approximationfor microarray missing value estimation, Pattern Recognition 43 (8) (2010)2993–3002.

Page 15: Nuevas Tecnicas de Imputacion

M.G. Rahman, M.Z. Islam / Knowledge-Based Systems 53 (2013) 51–65 65

[6] K. Cheng, N. Law, W. Siu, Iterative bicluster-based least square framework forestimation of missing values in microarray gene expression data, PatternRecognition 45 (4) (2012) 1281–1289, http://dx.doi.org/10.1016/j.patcog.2011.10.012.

[7] H. Kim, G. Golub, H. Park, Missing value estimation for dna microarray geneexpression data: local least squares imputation, Bioinformatics 21 (2) (2005)187–198.

[8] M. Abdella, T. Marwala, The use of genetic algorithms and neural networks toapproximate missing data in database, in: IEEE 3rd International Conferenceon Computational Cybernetics, 2005, ICCC 2005, IEEE, 2005, pp. 207–212.

[9] I.B. Aydilek, A. Arslan, A novel hybrid approach to estimating missing values indatabases using k-nearest neighbors and neural networks, InternationalJournal of Innovative Computing, Information and Control 8 (7A) (2012)4705–4717.

[10] E.-L. Silva-Ramírez, R. Pino-Mejías, M. López-Coello, M.-D. Cubiles-de-la Vega,Missing value imputation on missing completely at random data usingmultilayer perceptrons, Neural Networks 24 (1) (2011) 121–129.

[11] I.B. Aydilek, A. Arslan, A hybrid method for imputation of missing values usingoptimized fuzzy c-means with support vector regression and a geneticalgorithm, Information Sciences 233 (2013) 25–35.

[12] Z. Liao, X. Lu, T. Yang, H. Wang, Missing data imputation: a fuzzy k-meansclustering algorithm over sliding window, Sixth International Conference onFuzzy Systems and Knowledge Discovery, 2009, FSKD’09, vol. 3, IEEE, 2009, pp.133–137.

[13] T. Schneider, Analysis of incomplete climate data: estimation of mean valuesand covariance matrices and imputation of missing values, Journal of Climate14 (5) (2001) 853–871.

[14] M.G. Rahman, M.Z. Islam, A decision tree-based missing value imputationtechnique for data pre-processing, in: Australasian Data Mining Conference(AusDM 11), vol. 121 of CRPIT, ACS, Ballarat, Australia, 2011, pp. 41–50.<http://crpit.com/confpapers/CRPITV121Rahman.pdf>.

[15] J.R. Quinlan, Improved use of continuous attributes in C4.5, Journal of ArtificialIntelligence Research 4 (1996) 77–90.

[16] M.Z. Islam, H. Giggins, Knowledge discovery through sysfor – a systematicallydeveloped forest of multiple decision trees, in: Australasian Data MiningConference (AusDM 11), vol. 121 of CRPIT, ACS, Ballarat, Australia, 2011, pp.195–204.

[17] H. Junninen, H. Niska, K. Tuppurainen, J. Ruuskanen, M. Kolehmainen, Methodsfor imputation of missing values in air quality data sets, AtmosphericEnvironment 38 (18) (2004) 2895–2907.

[18] C. Willmott, Some comments on the evaluation of model performance, Bulletinof the American Meteorological Society 63 (1982) 1309–1369.

[19] M.Z. Islam, L. Brankovic, Privacy preserving data mining: a noise additionframework using a novel clustering technique, Knowledge-Based Systems 24(8) (2011) 1214–1223.

[20] M.Z. Islam, L. Brankovic, Detective: a decision tree based categorical valueclustering and perturbation technique in privacy preserving data mining, in:The 3rd International IEEE Conference on Industrial Informatics (INDIN 2005),IEEE, 2006, pp. 1081–1086.

[21] A. Frank, A. Asuncion, UCI machine learning repository, 2010. <http://archive.ics.uci.edu/ml> (accessed June 07.06.12).

[22] R. Jornsten, H.-Y. Wang, W.J. Welsh, M. Ouyang, Dna microarray dataimputation and significance analysis of differential expression,Bioinformatics 21 (22) (2005) 4155–4161.

[23] H.P. Giggins, L. Brankovic, Vicus – a noise addition technique for categoricaldata, in: Australasian Data Mining Conference (AusDM 12), vol. 134 of CRPIT,ACS, Sydney, Australia, 2012, pp. 139–148.

[24] C. Meyer, Matrix Analysis and Applied Linear Algebra Book and SolutionsManual, No. 71, Society for Industrial Mathematics, 2000.

[25] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan KaufmannPublishers, San Mateo, California, USA, 1993.

[26] J. Su, H. Zhang, A fast decision tree learning algorithm, Proceedings of theNational Conference on Artificial Intelligence, vol. 21, AAAI Press/MIT Press,Menlo Park, CA, Cambridge/MA, London, 1999, 2006, p. 500.

[27] S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara, S. Ishii, A bayesianmissing value estimation method for gene expression profile data,Bioinformatics 19 (16) (2003) 2088–2096.

[28] Z. Cai, M. Heydari, G. Lin, Iterated local least squares microarray missing valueimputation, Journal of Bioinformatics and Computational Biology 4 (5) (2006)935–958.

[29] M.G. Rahman, M.Z. Islam, T. Bossomaier, J. Gao, Cairad: a co-appearance basedanalysis for incorrect records and attribute-values detection, in: The 2012International Joint Conference on Neural Networks (IJCNN), IEEE, Brisbane,Australia, 2012, pp. 1–10, doi: 10.1109/IJCNN.2012.6252669.