This page provides the students from the CAP 6673: Data Mining and Machine Learning class with all the necessary information for the course.
Preston
Billion Polak E-mail is the preferred and the best way of reaching me. Before sending
me any questions, please read the textbook and relevant references carefully.
When sending me an E-mail, please include "CAP 6673" in the subject so that I can give it the necessary time and attention. The prescribed text book for this course is Data Mining and Machine Learning by Frank and Witten Data Sets The data sets are from a project labeled as CCCS.
We will use two datasets: Fit Data set to fit the models Test
Data set to evaluate the performance of the selected model on fresh (unseen)
data. You may download the two datasets by clicking on the links below: All the experiments for the assignments of this course will be performed
using WEKA. WEKA is an open source software issued under the GNU General
Public License, and could be downloaded for free from
http://www.cs.waikato.ac.nz/ml/weka/index.html. If you do not already have the Java Runtime Environment (JRE) or
have an old version, you should download the latest version of Weka that
comes with JRE. You could refer to Chapter 9 of the text book if you need assistance with the tool. You may use both the Command Line Interface (CLI) and the Graphical User Interface (GUI). A number of research papers related to the coursework
can be found on the
References Page.
Students are encouraged to read and understand these
papers in order to follow the material being covered / to be
covered later in this course. Also available under the reference section
are the slides for the topics covered in this course. Due dates are not flexible. Please provide your reports on the due date. Make sure you organize yourself and start at once - some experiments are very time consuming
if you have a slow computer. Do not start two days before the due date. All students should submit
one .pdf file of your homework named yourname_hwk*.pdf to CANVAS
by 11:59 PM on the due date.
(NOTE: the * means the current homework number. For example,
if you are handing in Homework 2, then "*" will be "2"). (1) The reports should include the detailed results of your
experiments. Make sure you present your results in a synthetic way (not
just a printout of the result provided by Weka). (2) Experimental work presented without any analysis is useless. Please
analyze your results, and draw meaningful conclusions. (3) There is no exact template for the reports,
but you have to organize them in a way that makes sense. For example, in the report
for the preparation of the datasets, you do not need to print the whole datasets
(that is several pages of data), just print the header and a few instances. (4) Do not forget to provide the methodology you used for your experiments. (5) One can look at the references to get an idea of the way researchers present, summarize,
and analyze the results. The grading for the project will be based solely on your reports, so
make sure you not only work hard, but also document your work well. Following are few important areas that will be considered while grading
your report: Well performed experiments following a good methodology Good presentation of the experiments (use tables) Synthetic analysis and comparison of the results of different models Conclusions A printout of the results without any explanation, comparison or conclusion is
not considered sufficient. Due date: Feb. 10, '24 The assignement is to prepare the datasets in a format required to be
used with Weka. You need to convert the files
into the ARFF format described in the section 2.4 of the textbook. Prepare
two sets of fit and test files. One set will be used to build and evaluate
prediction models, and the other to build and evaluate classification models: SET 1: [For Prediction]
Please make sure you label the data correctly and comment the ARFF file
(instances, attributes, date, author....). The original data file has 9 columns. Following is the description of
what each column represents (in the same order): A sample of an ARFF file for classification is available here. Due date: Feb. 10, '24 This part of the project will build models to predict the number of faults
based on the other attributes of the instances. Each model is to be first
built and evaluated using 10-fold cross validation on the fit data set,
and then validated using the test data set. Use the data sets prepared
for prediction in the previous assignment of the project. Build the following prediction models: For linear regression, compare the model selection methods: greedy, M5,
no selection. Compare the models, how many and which independent variables
were selected? Use the statistical indicators provided by Weka to perform the comparisons. Your report should include all the results based on 10-fold cross-validation
and on the test data set. You should also compare the results of all the
methods. Due date: Feb. 24, '24 Part 1: Initial tree This part of the project will allow you to predict a class (fp, nfp) using J48 (C4.5),
a decision tree based classification algorithm. Build a classification model using J48 (C4.5) using the fit data set and
10-fold cross validation. Determine the misclassification error rates (%)
for both types of misclassifications from the confusion matrix. Record the number of leaves and nodes in the selected tree, and represent
the tree in the same way as in the textbook. Repeat the previous tasks using the test data set to evaluate the model. Part 2: Unpruned tree Now in the J48 options, set the unpruned option to true. Rebuild the model in the
same way as above, repeat all steps. Now that you have represented the unpruned
tree, compare with the tree generated above, and determine the part that
was pruned. Part 3: Confidence Factor Now in the J48 options, set the confidence
factor (C) to 0.01. Rebuild the model in the same way as for the initial
tree (Part 1), repeat all the steps (of Part 1) How does the size of the new tree compare to one built
in Part 1? Explain why. What part was pruned? Part 4: Cost sensitivity Till now, we did not make any distinction between a Type I and a Type II
error. However, in Software Quality Classification, a Type II error is
more serious than a Type I error. Here, our objective is to obtain a balanced
misclassification rates with Type II as low as possible. Use the cost sensitive classifier combined with J48,
and determine the optimal cost ratio (set cost of a type I error
to 1 and vary the cost of the Type II error), using 10-fold cross validation
on the fit data set. Observe the trends in the misclassification rates.
What happens when the cost of a Type II error decreases/increases? Evaluate all the models on the test data set.
*** For tips on performing cost sensitive
classification, click here.
***
Due date: March 16, '24 This last section of the project will allow you to evaluate the benefits
of using meta learners on this data set: Using the same methods
as above (10-fold cross-validation on the fit data), determine the preferred
(optimal cost ratio) model, and evaluate the models using the test dataset
for: Use the default settings for the meta learners
(bagging, boosting, ) and the learner
(J48, decision stump) but vary the cost ratio in the same way as in Part
4 of Assignment II. Provide the command lines you used. How do the results of each classifier compare to the cost-sensitive tree
obtained in Part 4 of Assignment II? Comment. Now, set the number of iterations of each meta
learner to 25 and repeat the experiments. Don't forget to analyze the results.
E-mail:
Text Book
Syllabus
Modeling Tool
References
Due dates
Reports
Grading
I-A) Engineering the input: Preparing the datasets
I-B) Modeling assignment: Prediction
II) Modeling assignment: Classification using decision trees
III) Modeling assignment: Using meta learning
schemes with a strong and a weak learner for classification.
Due date: March 23, '24