Cost-constrained Data Acquisition for Data Quality Enhancement

 

Cost-Constrained Data Acquisition for Intelligent Data Preparation

Data mining techniques have been widely employed in various applications. A general mining procedure usually involves the following four steps: (1) collecting data, (2) transforming/cleansing the data, (3) model building, and (4) model deployment/monitoring. Most practitioners have agreed that data preparation, in steps (1) and (2), consumes the majority of the data mining cycle time and budget. As a result, intelligent data preparation has attracted a significant attention, in which data integrity and quality control issues play more and more important roles for effective mining. A common real-world problem is that many data mining applications are characterized by the collection of incomplete data in which some of the attribute values are missing (due to reasons such as privacy, data ownership or unavailability) but can be acquired at a cost. For example, a patient's record in hospital A may have the diagnoses of one particular disease missing, but this information is likely available in the patient's record in hospital B, and therefore can be acquired at a cost. Similarly, the credit card companies have data on customer transactions with their cards, but usually do not have data on the same customer's transactions with other cards. An important issue here for intelligent data preparation is how to acquire incomplete data items for better data quality, integrity, and consequently better results.

To build accurate predictive models, data acquisition is usually adopted to prepare the data and complete missing values in a dataset. However, due to the significant cost of doing so and the inherent correlations in the dataset, acquiring correct information for all instances is prohibitive and unnecessary. An interesting and important problem that arises here is to select what kinds of instances to complete so the model built from the processed data can receive the "maximum" performance improvement. This problem is complicated by the reality that the costs associated with the attributes are different, and fixing the missing values of some attributes is inherently more expensive than others. Therefore, the problem becomes that given a fixed budget, what kinds of instances should be selected for preparation, so that the learner built from the processed dataset can maximize its performance.

Intuitively, we believe that three important issues should be taken into consideration when designing cost-constrained data acquisition algorithms:
(1) Missing values with different attributes are inherently different, in terms of costs and their relevance to the target concept. We should pay more attention to the economical attributes, which are important for classification but cost relatively less.
(2) Some incomplete instances may already contain enough information for model construction, and putting efforts to them is unlikely to receive much improvement.
(3) Some missing values in an instance might be predictable by other instances, so we may put less effort to them when acquiring data for completion.

In this paper, we propose a solution for this problem, and the essential idea is to combine attribute costs and the relevance of each attribute to the target concept, so that the data acquisition can pay more attention to those attributes that are cheap in price but informative for classification. To this end, we will first introduce a unique Economical Factor (EF) that seamlessly integrates the cost and the importance (in terms of classification) of each attribute. Then we will propose a cost-constrained data acquisition model, where active learning, missing value prediction and impact-sensitive instance ranking are combined for effective data acquisition. Experimental results and comparative studies from real-world datasets demonstrate the effectiveness of our method.

Cost-constrained Data Acquisition with Active EF-Sensitive Instance Ranking

The pseudocode of AERA is shown in Fig. 1. It consists of three major steps: (1) Active instance selection; (2) Missing value prediction; and (3) Instance ranking for data acquisition. In the first step, we borrow the essential idea of active learning to select incomplete instances that cannot be effectively classified by a benchmark classifier, because those instances likely confuse the learner and deserve more attention. In the second step, for each selected incomplete instance Ik from step 1, we try to predict its missing values by using the Attribute Prediction mechanism (Section 4.2.2). If a missing value in Ik is predictable by others, we need not take this attribute into consideration when calculating the total Economical Factor value of Ik, EF(Ik). After that, we conduct the third step (instance ranking) by considering that for each incomplete instance, Ik, how many attributes contain missing values, and among all these missing values, how many of them are actually predictable by others. We use all this information to calculate the EF value of Ik, and rank Ik based on its EF value. The data acquisition is conducted by selecting instances from the ranked list.

The essential ideas that motivate the design of our algorithm, and also the novel features that distinguish our work from existing approaches, include three components: (1) we seamlessly combine the cost and the discrimination efficiency of each attribute into a unique Economical Factor to explore the economical attributes (informative but cheap) for effective data acquisition; (2) use the basic idea of active learning to select incomplete instances which likely confuse the learning theory to enhance the system performance; and (3) adopt a missing attribute value prediction to avoid introducing redundancy from information completion. The experimental results from 12 real-world datasets have demonstrated the effectiveness of the proposed approach. Our experimental results concluded that in the context that fixing attributes comes with different prices, the existing data acquisition mechanisms become less efficient and can totally fail, especially when the attribute cost-ratio (the cost ratio between the most and the least expensive attributes) becomes relatively large. With the proposed AERA algorithm, which integrates the attribute cost and discrimination efficiency, we can significantly improve the accuracy of the models built from the processed dataset, compared with several benchmark approaches.