Identifying Mislabeled Training Examples for Effective Learning

Class Noise Identification from Large Noisy Datasets

To clean mislabeled examples from a training dataset for efficient and effective induction, most existing approaches adopt a major set based scheme: the training dataset is separated into two parts (a major set and a minor set), and the classifiers learned from the major set are used to identify noise in the minor set. The obvious drawbacks of such a scheme are twofold: (1) when dealing with large datasets, it would be either physically impossible or time consuming to load the major set into the memory for inductive learning; and (2) for distributed datasets, it can be either technically infeasible or factitiously forbidden to download data from other sites (for security or privacy reasons). Therefore, these approaches have severe limitations in dealing with large, distributed datasets.

In this project, we present a new approach for identifying and eliminating mislabeled data items in large or distributed datasets. We first partition a dataset into subsets, each of which is small enough to be processed by an induction algorithm at one time. We construct good rules from each subset, and use the good rules to evaluate the whole dataset. For a given instance Ik, two error count variables are used to count the number of times it has been identified as noise by all data subsets. The instance with higher error values will have a higher probability of being a mislabeled example. Two threshold schemes, majority and non-objection, are used to identify and eliminate the noisy examples. Experimental results and comparative studies on both real-world and synthetic datasets are reported to evaluate the effectiveness and efficiency of the proposed approach.

The novel feature that distinguishes our proposed framework from existing research efforts is threefold:  First, our strategy conducts noise elimination from a minor set which is more reasonable to handle large or distributed datasets. Second, unlike other strategies where classifiers induced from the major part are directly used to identify noise, we take each subset as a committee member for noise identification, and let it classify only the data it has the confidence. Third, while existing multi-round noise elimination mechanisms remove the identified noise only, we remove both identified noisy examples and good instances in each round. Experimental evaluations and comparative studies have shown that our proposed approach is effective and robust to identify noise and improve the classification accuracy. 

 

Figure 1. Noise elimination from large datasets (Partitioning Filter)

 

Cost-guided Class Noise Handling for Effective Cost-sensitive Learning

Recent research in machine learning, data mining and related areas has produced a wide variety of algorithms for cost-sensitive (CS) classification, where instead of maximizing the classification accuracy, minimizing the misclassification cost becomes the objective. However, these methods assume that training sets do not contain significant noise, which is rarely the case in real-world environments.

 

For normal learning algorithms (non-CS), the existence of noise actually brings trained classifiers various negative impacts such as decreasing the classification accuracy, and increasing the training time and the tree size. Accordingly, the problem of learning in noisy environments for non-CS algorithms has been the focus of much attention in machine learning and most inductive learning algorithms have a mechanism for handling noise. For example, pruning in decision trees is designed to reduce the chance that the trees are overfitting to noise. As suggested by Gamberger et al., handling noise from the data before hypothesis formation has the advantage that noise dos not influence hypothesis construction. Accordingly, when learning from noisy datasets, a logical solution to enhance the learners is to cleanse noise in some way. Nevertheless all these conclusions are made for normal inductive learning, where maximizing classification accuracy is the goal. The objective of CS learning, however, is to minimize the misclassification cost, where the accuracy becomes less important, especially when the misclassification of some classes becomes much more expensive than misclassifying others, i.e., C(i, j) >> C(j, i), i¹j, or vice versa. A simple analysis may imply that the existence of noise could have less impact on CS learning, because decreasing the accuracy does not necessarily increase the misclassification cost of a CS classifier (given that a CS learner usually sacrifices accuracy for minimal costs). Unfortunately, rare research has systematically addressed the behavior of CS learners in noisy environments, which leaves noise handing for effective CS learning still an open problem.

 

In this paper, we systematically study the impacts of class noise on CS learning, and propose a cost-guided class noise handling algorithm to identify noise for effective CS learning (Figure 3). We call it Cost-guided Iterative Classification Filter (CICF), because it seamlessly integrates costs and an existing Classification Filter (Figure 2) for noise identification. Instead of putting equal weights to handle noise in all classes in existing efforts, CICF puts more emphasis on expensive classes, which makes it especially successful in dealing with datasets with a large cost-ratio. Experimental results and comparative studies from real-world datasets indicate that the existence of noise may seriously corrupt the performance of CS classifiers, and by adopting the proposed CICF algorithm, we can significantly reduce the misclassification cost of a CS classifier in noisy environments.

 

Figure 2. Classification Filter

 

Figure 3. Cost-guided Iterative Classification Filter