Stream Data Mining Repository
This repository contains several
data stream files I collected from different sources. They are mainly used for
the research development and the algorithm assessment purposes. If you are interested
in donating data stream files or making comments to this webpage, please feel
free to drop me a note.
Reference to the repository: X. Zhu, Stream Data Mining Repository, http://www.cse.fau.edu/~xqzhu/stream.html,
2010.
[Sensor
Stream, 2,219,803 instances, 5 attributes, and 54 classes]
Sensor stream contains information
(temperature, humidity, light, and sensor voltage) collected from 54 sensors
deployed in Intel Berkeley
Research Lab. The whole stream contains consecutive information recorded
over a 2 months period (1 reading per 1-3 minutes). I used the sensor ID as the
class label, so the learning task of the stream is to correctly identify the
sensor ID (1 out of 54 sensors) purely based on the sensor data and the
corresponding recording time. While the data stream flow over time, so does the
concepts underlying the stream. For example, the lighting during the working
hours is generally stronger than the night, and the temperature of specific
sensors (conference room) may regularly rise during the meetings.
This picture was copied from the MIT Computer Science and Artificial
Intelligence Lab data repository
[Powersupply Stream,
29,928 instances, 2 attributes, and 24 classes]
Powersupply stream contains hourly power supply of an
This data stream was transferred
from Prof. Eamonn Keogh’s UCR Rime Series
Classification/Clustering Page. I simply transferred a time serious data as
a data stream for prediction.
[Kddcup99, 494,021
instances, 41 attributes, and 23 classes]
Kddcup99 stream was collected from the KDD CUP challenge in 1999,
and the task is to build predictive models capable of distinguishing between
intrusions and normal connections. Notice that this is NOT a real-world data
stream. I simply collected and transferred the data into ARFF format and treat
it as a data stream. Clearly, the instances in the stream do not flow in similar
way as the genuine stream data (I also randomized the file to sure that classes
are uniformly distributed).
[Hyper Plane Stream, 100,000
instances, 10 attributes, and 5 classes]
HyperP stream is a synthetic data stream
containing gradually evolving (drifting) concepts defined by Eq. (1), where the
value aj, j=1, 2,.., d, controls the shape of the decision surfaces, and the value f(x)
determines the class label of each instance x.
The concept drifting of the data streams is simulated and controlled through
the following parameters: (1) t,
controlling the magnitude of the concept drifting; (2) p, controlling the number of attributes whose weights are involved
in the change; and (3) h and gÎ{-1, 1}, controlling the weight
adjustment direction for attributes involved in the change. After the
generation of each instance x, ai is adjusted continuously
by g×t / M (as long as ai is involved in the concept drifting). Meanwhile,
after the generation of M instances,
there is an h percentage of chances
that weight change will inverse its direction, i.e., g = -g for all attributes ai involved in the change.
The Hyper Plane stream has five classes and 100,000 instances, each of which
contains d=10 dimensions. The concept
drifting involves p=5 attributes, and
attribute weights change with a magnitude of t=0.1 in every M=2000
instances and weight adjustment inverses the direction with h=20% of chance.
(1)