Stream Data Mining Repository

Stream Data Mining Repository

This repository contains several data stream files I collected from different sources. They are mainly used for the research development and the algorithm assessment purposes. If you are interested in donating data stream files or making comments to this webpage, please feel free to drop me a note.

Reference to the repository: X. Zhu, Stream Data Mining Repository, http://www.cse.fau.edu/~xqzhu/stream.html, 2010.

[Sensor Stream, 2,219,803 instances, 5 attributes, and 54 classes]

Sensor stream contains information (temperature, humidity, light, and sensor voltage) collected from 54 sensors deployed in Intel Berkeley Research Lab. The whole stream contains consecutive information recorded over a 2 months period (1 reading per 1-3 minutes). I used the sensor ID as the class label, so the learning task of the stream is to correctly identify the sensor ID (1 out of 54 sensors) purely based on the sensor data and the corresponding recording time. While the data stream flow over time, so does the concepts underlying the stream. For example, the lighting during the working hours is generally stronger than the night, and the temperature of specific sensors (conference room) may regularly rise during the meetings.

This picture was copied from the MIT Computer Science and Artificial Intelligence Lab data repository

[Powersupply Stream, 29,928 instances, 2 attributes, and 24 classes]

Powersupply stream contains hourly power supply of an Italy electricity company which records the power from two sources: power supply from main grid and power transformed from other grids. This stream contains three year power supply records from 1995 to 1998, and our learning task is to predict which hour (1 out of 24 hours) the current power supply belongs to. The concept drifting in this stream is mainly driven by the issues such as the season, weather, hours of a day (e.g., morning and evening), and the differences between working days and weekend.

This data stream was transferred from Prof. Eamonn Keogh’s UCR Rime Series Classification/Clustering Page. I simply transferred a time serious data as a data stream for prediction.

[Kddcup99, 494,021 instances, 41 attributes, and 23 classes]

Kddcup99 stream was collected from the KDD CUP challenge in 1999, and the task is to build predictive models capable of distinguishing between intrusions and normal connections. Notice that this is NOT a real-world data stream. I simply collected and transferred the data into ARFF format and treat it as a data stream. Clearly, the instances in the stream do not flow in similar way as the genuine stream data (I also randomized the file to sure that classes are uniformly distributed).

[Hyper Plane Stream, 100,000 instances, 10 attributes, and 5 classes]

HyperP stream is a synthetic data stream containing gradually evolving (drifting) concepts defined by Eq. (1), where the value a_j, j=1, 2,.., d, controls the shape of the decision surfaces, and the value f(x) determines the class label of each instance x. The concept drifting of the data streams is simulated and controlled through the following parameters: (1) t, controlling the magnitude of the concept drifting; (2) p, controlling the number of attributes whose weights are involved in the change; and (3) h and gÎ{-1, 1}, controlling the weight adjustment direction for attributes involved in the change. After the generation of each instance x, a_i is adjusted continuously by g×t / M (as long as a_i is involved in the concept drifting). Meanwhile, after the generation of M instances, there is an h percentage of chances that weight change will inverse its direction, i.e., g = -g for all attributes a_i involved in the change. The Hyper Plane stream has five classes and 100,000 instances, each of which contains d=10 dimensions. The concept drifting involves p=5 attributes, and attribute weights change with a magnitude of t=0.1 in every M=2000 instances and weight adjustment inverses the direction with h=20% of chance.

(1)