Identifying
and Localization Errors (Noise) Which Are Introduced to the Attributes
|
Error Detection and Impact-sensitive Instance Ranking in Noisy Datasets Given a noisy dataset, how to locate erroneous instances and attributes and rank suspicious instances based on their impacts with the system performance is an interesting and important research issue. We provide in this paper an Error Detection and Impact-sensitive instance Ranking (EDIR) mechanism to address this problem. Given a noisy dataset D, we first train a benchmark classifier T from D. The instances, which cannot be effectively classified by T, are treated as suspicious instances and forwarded to a subset S. For each attribute Ai, we switch Ai and class label C to train a classifier APi for Ai. Given instance Ik in S, we use APi and the benchmark classifier T to locate the erroneous value of each attribute Ai. To quantitatively rank instances in S, we define an impact measure that bases on the Information-gain Ratio (IR). We calculate IRi between attribute Ai and C, and use IRi as the impact-sensitive weight of Ai. The sum of impact-sensitive weights from all located erroneous attributes of Ik indicates its total impact value. The experimental results demonstrate the effectiveness of our strategies.
Our EDIR system consists of two major steps: Error Detection and Impact-sensitive Ranking. The system flowchart is depicted in Fig. 1, and the procedures are given in Figs. 2 and 3.
The novel features that distinguish our work from existing approaches are threefold: (1) we provided an error detection algorithm for both instances and attributes; (2) we explored a new research topic on impact-sensitive instance ranking, which can be very useful in guiding the data manager to enhance the data quality with minimal expenses; and (3) through the combination of error detection and impact-sensitive ranking, we have constructed an effective data recommendation system. It's more effective than the manual approach and more reliable than automatic correction algorithms.
|
|
Dealing with Predictive-but-Unpredictable Attributes in Noisy Data Sources Attribute noise can affect classification learning.
Previous work in handling attribute noise has focused on those predictable
attributes that can be predicted by the class and other attributes.
However, attributes can often be predictive but unpredictable. Being
predictive, they are essential to
The knowledge acquired by our study, although preliminary, is informative. We expect it to contribute to completing the picture of attribute noise handling. However, a single study seldom settles an issue once and for all. More efforts are needed to further advance this research field. We name three topics here. First, whether an attribute is predictable or unpredictable is a matter of degree. In our current research, we deem an attribute unpredictable when its prediction accuracy is significantly lower than the class. Further research to work out more sophisticated thresholds or heuristics would be interesting. Second, although we take polishing as straw man, sifting does not claim to outperform polishing for predictable attributes. Instead, sifting and polishing are parallel, each having its own niche to work. Hence it is sensible to combine them. A work frame might be: (1) use feature selection to discard unpredictive attributes; (2) decide whether a predictive attribute is predictable; (3) if it is predictable, use polishing to handle its noise; and if it is unpredictable, use sifting to handle its noise. Lastly, it would be enchanting to extend our research beyond classification learning, such as to association learning where patterns exist but attributes are seldom predictable.
|