ICHECK:
Identifying Deception Data with Impact-Sensitive Instance Ranking
Sponsor: Vermont EPSCoR. Under Grant NSF EPS-0236976.
Duration: January 1, 2005 - December 31, 2005.
Research
Staff:
Principal Investigator: Dr. Xingquan
Zhu.
Co-Principal Investigator: Dr. Xindong
Wu
Ph.D Student: Yan Zhang
Recent advances in highperformance computing and communication
technologies have made it possible to acquire, provide and share large volumes
of computer mediated information. This also brings us challenging tasks in
identifying intentionally planted false information, commonly known as deception,
from the information we get, where the general purpose of planting deception
is for an incorrect impression or conclusion. There are many possible ways
to deceive, such as lies, fabrications, concealments, and misdirections. They
are common, everyday occurrences we experience. Unfortunately, although we
all directly or indirectly suffer from the consequences of deception, it is
often difficult to detect false information. This problem has become especially
significant for public safety, law enforcement and national security agencies,
given the huge volume of data in their databases and the emergency and importance
of identifying suspicious data items.
The challenge of detecting deceptive information has become
more significant after 9-11. Tremendous efforts are needed to analyze messages,
compare with previous records and figure out the suspicious information. The
evidence from existing empirical studies shows that humans are typically very
poor at detecting deception and fallacious information: especially when the
messages are sourced from text-based, computer mediated channels, where the
accuracy is little better than chance. Tools that facilitate human deception
detection are therefore valuable, and such tools would also benefit law enforcement
in dealing with national security and criminal investigation.
To identify deception, existing efforts mainly focus on two
types of mechanisms: using electronic equipment such as lie detectors to detect
abnormal behaviors; and exploring linguistic cues to distinguish deception
and truth in text-based messages. In this project, we propose an ICHECK (Identifying
deception with impact-sensitive instance ranking) system to provide a generic
system framework to identify deception for real-world datasets, where deception
is defined as the suspicious instances that do not comply with other, normal
instances. Meanwhile, in addition to identifying deception, ICHECK will also
rank those identified suspicious instances based on their impacts on the dataset
performance. The system framework of ICHECK is shown in Figure 1, where it
consists of two major steps: Deception Detection and Impact-sensitive Ranking.
Given a real-world dataset D, we will first check whether or not the dataset
comes with a class label for each of its instances. If the class label has
already existed for instances in D, we will train a benchmark classifier T
from D. The instances, which cannot be effectively classified by T, will be
treated as suspicious instances and forwarded to a subset S. For each data
attribute Ai, we will switch Ai and class label C to train a classifier APi
for Ai. Given an instance Ik in S, we will use APi and the benchmark classifier
T to locate the deceptive value of each attribute Ai. To quantitatively rank
instances in S, we will define an impact measure based on the Information-gain
Ratio (IR). We will then calculate IRi between attribute Ai and C, and use
IRi as the impact-sensitive weight of Ai. The sum of the impact-sensitive
weights from all located deceptive attribute values of Ik will indicate its
total impact value.
In the case that dataset D does not have a class label for
its records (which is likely common for databases which are constructed for
non-classification purposes), we will start with each attribute Ai, and take
the classifier APi which has the highest classification accuracy as the benchmark
classifier T to identify suspicious instances.
The essential goal of ICHECK is to identify and rank deception
data from general datasets. The merits of ICHECK in comparison with existing
outlier detection methods come from the following two aspects: (1) ICHECK
identifies deception data at the attribute level and will explore the suspicious
attribute values which cause the instances to be a deception; and (2) in addition
to identifying deception, ICHECK also ranks deception data based on their
impacts to the system.
A significant advantage of ICHECK is that it can be implemented
on an existing dataset, and no extra effort will be required to select deception
cues (attributes which likely determine where an instance contains deception).
When adopting this data mining approach for deception identification, it will
be useful if a class label (category information) has been marked for each
instance, and if a dataset does not has such label information, the proposed
ICHECK will automatically select the most informative attribute as the class
label and proceed the deception detection. Results from this research can
be of significance to the following two topics:
1. A generic deception detection approach with which deceptive
attribute values and class labels can be identified without the necessity
of reconstructing or organizing the database.
2. An effect impact-sensitive instance ranking approach for
ranking suspicious instances based on their negative impacts on the system,
so that given a certain amount of expenses (e.g, processing time), the data manager can put priority on instances with higher
impacts (because they might be more dangerous than others).
