Combining Univariate and Multivariate Bottom-up Discretization
Yu Sang and Keqiu Li
Most inductive learning methods require that the training data set contains only discrete attributes, which makes it necessary to discretize continuous numeric attributes. Current efforts mainly focus on discretizing data for individual attributes, without taking into account the correlations among attributes and the number of inconsistent records produced by discretization. In addition, existing methods focus only on one-dimensional problem without extensively considering the effect of interval size and class number on discretization schemes. In this paper, we present a method by combining univariate and multivariate bottom-up discretization that employs a novel merging and stopping criterion. First, we present a new merging criterion based on both univariate and multivariate measurement, which synthetically evaluates the variance among the adjacent interval pairs to find the best merge and effectively captures the correlations among the continuous attributes. This is achieved by using the Minimum Description Length Principle and developing a measurement of significance of interval pair among attributes. The advantage of our proposed merging criterion is further analyzed. Second, we present a new stopping criterion with the aim to control the degree of misclassification while maximizing the merging accuracy. Moreover, we develop an algorithm to find the best discretization based on the new merging and stopping criterion. Detailed analysis shows that the proposed method brings higher accuracy to the discretization process. Finally, empirical experiments on 18 real data sets show that our method generates a better discretization scheme that significantly improves the accuracy of classification than existing methods by using popular learning systems such as C4.5 decision tree.
Keywords: Discretization, Univariate and multivariate, Merging criterion, Stopping criterion, Significance of interval pair, Minimum Description Length Principle (MDLP)