Data Quality and Integrity Assessment and Noise Impact Analysis

 

Class Noise vs Attribute Noise: A Quantitative Study of Their Impacts

Real-world data is never perfect and can often suffer from corruptions (noise) that may impact interpretations of the data, models created from the data and decisions made based on the data. Noise can reduce system performance in terms of classification accuracy, time in building a classifier and the size of the classifier. Accordingly, most existing learning algorithms have integrated various approaches to enhance their learning abilities from noisy environments, but the existence of noise can still introduce serious negative impacts. A more reasonable solution might be to employ some preprocessing mechanisms to handle noisy instances before a learner is formed. Among them, the enhancement could be achieved by adopting some data cleansing procedures, such as eliminating noisy instances, predicting unknown (or missing) attribute values, or correcting noisy values. These methods are efficient in their own scenarios, but some important issues are still open, especially when we try to view noise in a systematic way and attempt to design generic noise handling approaches. Actually, existing mechanisms seem to be developed without a thorough understanding of noise. To design a good data quality enhancement tool, we believe the following questions should be answered in advance to avoid developing a "blind" approach, which cannot guarantee its performance all the time.
1. What's noise in machine learning? What's the inherent relationship between noise and data quality?
2. What are the features of noise, and what's their impact with the system performance?
3. What's a general solution in handling noise (especially attribute noise)? Why does it work?

Unfortunately, rare research has been conducted to systematically explore the impact of noise, especially from the noise handling point of view. This has made various noise processing techniques less significant, specifically when dealing with noise that is introduced in attributes. In this paper, we present a systematic evaluation on the effect of noise in machine learning. Instead of taking any unified theory of noise to evaluate the noise impacts, we differentiate noise into two categories: class noise and attribute noise, and analyze their impacts on the system performance separately. Because class noise has been widely addressed in existing research efforts, we concentrate on attribute noise. We investigate the relationship between attribute noise and classification accuracy, the impact of noise at different attributes, and possible solutions in handling attribute noise. Our conclusions can be used to guide interested readers to enhance data quality by designing various noise handling mechanisms.

Our investigation with the noise impact has been conducted through the following aspects:

1. Impact of Class Noise

2. Effects of Attribute Noise with Classification Accuracy

3. Experimental Evaluations from Partially Cleaned Noisy Datasets

4. Impact of Attribute Noise from Different Attributes

5. Attribute Noise vs Class Noise: Which Is More Harmful?

6. Discussion on Attribute Noise Handling

Figure 1. A systematic analysis in handling attribute noise

The conclusions from our experiments can be summarized as follows:
1. Eliminating instances containing class noise will likely enhance the classification accuracy.
2. In comparison with class noise, the attribute noise is usually less harmful, but could still bring severe problems to learning algorithms.
3. When handling attribute noise, noise correction will likely enhance the accuracy of learned classifiers.
4. In comparison with noise handling from the training set, cleaning noise from the test set usually brings more benefits (in terms of classification accuracy), even if the classifier is learned from a noise corrupted training set (without any noise handling mechanisms).
5. In the case that noise handling from a test set is forbidden, cleaning attribute noise from a training set will still likely enhance the classification accuracy comprehensively, no matter whether the test set contains noise or not.
6. In most situations, the noise from different attributes behaves differently with the system performance. The higher the correlation between an attribute and the class, the more negative impact the attribute noise may bring. Accordingly, it's not necessary for a noise handling mechanism to take care of every attribute, and handing noise on noise-sensitive attributes would be more important.
7. To identify and correct attribute noise, we can adopt some learning algorithms to learn a noise filter. However, analyzing correlations among attributes in advance is necessary in this case, and it could tell whether a specific attribute is predictable by using other attributes and the class, because an attribute with low correlations with others simply cannot be predicted by any learning theory.
8. More experiments should be conducted on identifying and correcting those attributes that have low correlations with others.
With these conclusions, instead of adopting some "blind" noise handling mechanisms, interested readers can design their own noise handling approaches to enhance data quality from their own perspectives.

Data Quality and Integrity Assessment

This is an ongoing research project. Given a real-world dataset D, my intention of conducting this part of research is to answer the following questions:

1. Whether the dataset D is noise freee (class noise and attribute) or not? if not, what's the noise level in dataset D.

2. Given a real-world dataset, which instances are responsible for the poor data quality (or integrity) of D? in addition, how can we produce a list of instances where the instances are ranked by the impact to the database quality? So the database manager can always pay more attention to the instances with higher rank.

3. Given a real-world dataset D and one data mining approach, say method one M1, we can always produce some results by applying M1 on D. But the problem is that with the result iteself, we don't actually know whether it is good enought. In another world, it is not clear that whether we can adopt other data mining methods to get better results? or we can adopt some data preprocessing approaches on D, so the method M1 can have better results.

Motivated by the above concerns, I have implemented several solutions and got some preliminary results (pending for conference or journal submission). If you have any interests or comments in this regard, please contact me, I will be more than happy to share more results.