ICHECK: Identifying Deception Data with Impact-Sensitive Instance Ranking

s

Sponsor: Vermont EPSCoR. Under Grant NSF EPS-0236976.
Duration: January 1, 2005 - December 31, 2005.

Research Staff:

Principal Investigator: Dr. Xingquan Zhu.

Co-Principal Investigator: Dr. Xindong Wu

Ph.D Student: Yan Zhang

Project Summary:

Recent advances in highperformance computing and communication technologies have made it possible to acquire, provide and share large volumes of computer mediated information. This also brings us challenging tasks in identifying intentionally planted false information, commonly known as deception, from the information we get, where the general purpose of planting deception is for an incorrect impression or conclusion. There are many possible ways to deceive, such as lies, fabrications, concealments, and misdirections. They are common, everyday occurrences we experience. Unfortunately, although we all directly or indirectly suffer from the consequences of deception, it is often difficult to detect false information. This problem has become especially significant for public safety, law enforcement and national security agencies, given the huge volume of data in their databases and the emergency and importance of identifying suspicious data items.
The challenge of detecting deceptive information has become more significant after 9-11. Tremendous efforts are needed to analyze messages, compare with previous records and figure out the suspicious information. The evidence from existing empirical studies shows that humans are typically very poor at detecting deception and fallacious information: especially when the messages are sourced from text-based, computer mediated channels, where the accuracy is little better than chance. Tools that facilitate human deception detection are therefore valuable, and such tools would also benefit law enforcement in dealing with national security and criminal investigation.
To identify deception, existing efforts mainly focus on two types of mechanisms: using electronic equipment such as lie detectors to detect abnormal behaviors; and exploring linguistic cues to distinguish deception and truth in text-based messages. In this project, we propose an ICHECK (Identifying deception with impact-sensitive instance ranking) system to provide a generic system framework to identify deception for real-world datasets, where deception is defined as the suspicious instances that do not comply with other, normal instances. Meanwhile, in addition to identifying deception, ICHECK will also rank those identified suspicious instances based on their impacts on the dataset performance. The system framework of ICHECK is shown in Figure 1, where it consists of two major steps: Deception Detection and Impact-sensitive Ranking. Given a real-world dataset D, we will first check whether or not the dataset comes with a class label for each of its instances. If the class label has already existed for instances in D, we will train a benchmark classifier T from D. The instances, which cannot be effectively classified by T, will be treated as suspicious instances and forwarded to a subset S. For each data attribute Ai, we will switch Ai and class label C to train a classifier APi for Ai. Given an instance Ik in S, we will use APi and the benchmark classifier T to locate the deceptive value of each attribute Ai. To quantitatively rank instances in S, we will define an impact measure based on the Information-gain Ratio (IR). We will then calculate IRi between attribute Ai and C, and use IRi as the impact-sensitive weight of Ai. The sum of the impact-sensitive weights from all located deceptive attribute values of Ik will indicate its total impact value.
In the case that dataset D does not have a class label for its records (which is likely common for databases which are constructed for non-classification purposes), we will start with each attribute Ai, and take the classifier APi which has the highest classification accuracy as the benchmark classifier T to identify suspicious instances.
The essential goal of ICHECK is to identify and rank deception data from general datasets. The merits of ICHECK in comparison with existing outlier detection methods come from the following two aspects: (1) ICHECK identifies deception data at the attribute level and will explore the suspicious attribute values which cause the instances to be a deception; and (2) in addition to identifying deception, ICHECK also ranks deception data based on their impacts to the system.
A significant advantage of ICHECK is that it can be implemented on an existing dataset, and no extra effort will be required to select deception cues (attributes which likely determine where an instance contains deception). When adopting this data mining approach for deception identification, it will be useful if a class label (category information) has been marked for each instance, and if a dataset does not has such label information, the proposed ICHECK will automatically select the most informative attribute as the class label and proceed the deception detection. Results from this research can be of significance to the following two topics:
1. A generic deception detection approach with which deceptive attribute values and class labels can be identified without the necessity of reconstructing or organizing the database.
2. An effect impact-sensitive instance ranking approach for ranking suspicious instances based on their negative impacts on the system, so that given a certain amount of expenses (e.g, processing time), the data manager can put priority on instances with higher impacts (because they might be more dangerous than others).

       

Publications:

  1. Xingquan Zhu and Xindong Wu, "Cost-Constrained Data Acquisition for Intelligent Data Preparation", IEEE Transactions on Knowledge and Data Engineering, vol.17, no.11, pp.1542-1556, 2005. [abstract]

  2. Yan Zhang, Xingquan Zhu, Xindong Wu, and Jeffrey P. Bond, "ACE: An Aggressive Classifier Ensemble with Error Detection, Correction and Cleansing", in Proceedings of the 17th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2005), Hong Kong, November 14-16, 2005. [abstract]

  3. Xingquan Zhu and Xindong Wu, "Data Acquisition with Active Impact-Sensitive Instance Selection", in Proceedings of the 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2004), Boca Raton, FL, November 15 - 17 2004. [abstract]

  4. Xingquan Zhu, Xindong Wu and Ying Yang, "Error Detection and Impact-sensitive Instance Ranking in Noisy Datasets", in Proceedings of the 19th National Conference on Artificial Intelligence (AAAI-04), July 25-29, 2004, San Jose, California. [abstract]