Data Quality and Integrity Assessment and Noise Impact Analysis
|
Class Noise vs Attribute Noise: A Quantitative Study of Their Impacts Real-world data is never perfect and can often suffer from corruptions
(noise) that may impact interpretations of the data, models created from
the data and decisions made based on the data. Noise can reduce system
performance in terms of classification accuracy, time in building a classifier
and the size of the classifier. Accordingly, most existing learning algorithms
have integrated various approaches to enhance their learning abilities
from noisy environments, but the existence of noise can still introduce
serious negative impacts. A more reasonable solution might be to employ
some preprocessing mechanisms to handle noisy instances before a learner
is formed. Among them, the enhancement could be achieved by adopting some
data cleansing procedures, such as eliminating noisy instances, predicting
unknown (or missing) attribute values, or correcting noisy values. These
methods are efficient in their own scenarios, but some important issues
are still open, especially when we try to view noise in a systematic way
and attempt to design generic noise handling approaches. Actually, existing
mechanisms seem to be developed without a thorough understanding of noise.
To design a good data quality enhancement tool, we believe the following
questions should be answered in advance to avoid developing a "blind"
approach, which cannot guarantee its performance all the time. Unfortunately, rare research has been conducted to systematically explore the impact of noise, especially from the noise handling point of view. This has made various noise processing techniques less significant, specifically when dealing with noise that is introduced in attributes. In this paper, we present a systematic evaluation on the effect of noise in machine learning. Instead of taking any unified theory of noise to evaluate the noise impacts, we differentiate noise into two categories: class noise and attribute noise, and analyze their impacts on the system performance separately. Because class noise has been widely addressed in existing research efforts, we concentrate on attribute noise. We investigate the relationship between attribute noise and classification accuracy, the impact of noise at different attributes, and possible solutions in handling attribute noise. Our conclusions can be used to guide interested readers to enhance data quality by designing various noise handling mechanisms. Our investigation with the noise impact has been conducted through the following aspects: 1. Impact of Class Noise 2. Effects of Attribute Noise with Classification Accuracy 3. Experimental Evaluations from Partially Cleaned Noisy Datasets 4. Impact of Attribute Noise from Different Attributes 5. Attribute Noise vs Class Noise: Which Is More Harmful? 6. Discussion on Attribute Noise Handling
Figure 1. A systematic analysis in handling attribute noise The conclusions from our experiments can be summarized as follows: |
|
Data Quality and Integrity Assessment This is an ongoing research project. Given a real-world dataset D, my intention of conducting this part of research is to answer the following questions: 1. Whether the dataset D is noise freee (class noise and attribute) or not? if not, what's the noise level in dataset D. 2. Given a real-world dataset, which instances are responsible for the poor data quality (or integrity) of D? in addition, how can we produce a list of instances where the instances are ranked by the impact to the database quality? So the database manager can always pay more attention to the instances with higher rank. 3. Given a real-world dataset D and one data mining approach, say method one M1, we can always produce some results by applying M1 on D. But the problem is that with the result iteself, we don't actually know whether it is good enought. In another world, it is not clear that whether we can adopt other data mining methods to get better results? or we can adopt some data preprocessing approaches on D, so the method M1 can have better results. Motivated by the above concerns, I have implemented several solutions and got some preliminary results (pending for conference or journal submission). If you have any interests or comments in this regard, please contact me, I will be more than happy to share more results. |