CAP 6673: Data Mining and Machine Learning

Spring 2023

This page was last updated on 01/17/2023

This page provides the students from the CAP 6673: Data Mining and Machine Learning class with all the necessary information for the course.


Preston Billion Polak

E-mail is the preferred and the best way of reaching me. Before sending me any questions, please read the textbook and relevant references carefully. When sending me an E-mail, please include "CAP 6673" in the subject so that I can give it the necessary time and attention. 

Text Book

The prescribed text book for this course is Data Mining and Machine Learning by Frank and Witten



Data Sets

The data sets are from a project labeled as CCCS. We will use two datasets:

  1. Fit Data set to fit the models

  2. Test Data set to evaluate the performance of the selected model on fresh (unseen) data.

You may download the two datasets by clicking on the links below:

Fit data
Test data

Modeling Tool

All the experiments for the assignments of this course will be performed using WEKA. 

  • WEKA is an open source software issued under the GNU General Public License, and could be downloaded for free from

  • If you do not already have the Java Runtime Environment (JRE) or have an old version, you should download the latest version of Weka that comes with JRE.

  • You could refer to Chapter 9 of the text book if you need assistance with the tool.

  • You may use both the Command Line Interface (CLI) and the Graphical User Interface (GUI).


A number of research papers related to the coursework can be found on the References Page. Students are encouraged to read and understand these papers in order to follow the material being covered / to be covered later in this course. Also available under the reference section are the slides for the topics covered in this course. 

Due dates

Due dates are not flexible. Please provide your reports on the due date. Make sure you organize yourself and start at once - some experiments are very time consuming if you have a slow computer. Do not start two days before the due date.

All students should submit one .pdf file of your homework named yourname_hwk*.pdf to CANVAS by 11:59 PM on the due date. (NOTE: the * means the current homework number. For example, if you are handing in Homework 2, then "*" will be "2").


(1) The reports should include the detailed results of your experiments. Make sure you present your results in a synthetic way (not just a printout of the result provided by Weka).

(2) Experimental work presented without any analysis is useless. Please analyze your results, and draw meaningful conclusions.

(3) There is no exact template for the reports, but you have to organize them in a way that makes sense. For example, in the report for the preparation of the datasets, you do not need to print the whole datasets (that is several pages of data), just print the header and a few instances.

(4) Do not forget to provide the methodology you used for your experiments.

(5) One can look at the references to get an idea of the way researchers present, summarize, and analyze the results.


The grading for the project will be based solely on your reports, so make sure you not only work hard, but also document your work well.  Following are few important areas that will be considered while grading your report:

A printout of the results without any explanation, comparison or conclusion is not considered sufficient.

I-A) Engineering the input: Preparing the datasets

Due date: Feb. 09, '23

The assignement is to prepare the datasets in a format required to be used with Weka. You need to convert the files into the ARFF format described in the section 2.4 of the textbook. Prepare two sets of fit and test files. One set will be used to build and evaluate prediction models, and the other to build and evaluate classification models:

Please make sure you label the data correctly and comment the ARFF file (instances, attributes, date, author....).

The original data file has 9 columns. Following is the description of what each column represents (in the same order):

  1. Number of unique operators (NUMUORS)
  2. Number of unique operands (NUMUANDS)
  3. Total number of operators (TOTOTORS)
  4. Total number of operands (TOTOPANDS)
  5. McCabe's cyclomatic complexity (VG)
  6. Number of logical operators (NLOGIC)
  7. Lines of code (LOC)
  8. Executable line of code (ELOC)
  9. Number of faults (FAULTS)

A sample of an ARFF file for classification is available here.

I-B) Modeling assignment: Prediction

Due date: Feb. 09, '23

This part of the project will build models to predict the number of faults based on the other attributes of the instances. Each model is to be first built and evaluated using 10-fold cross validation on the fit data set, and then validated using the test data set. Use the data sets prepared for prediction in the previous assignment of the project.

Build the following prediction models:

  1. Linear Regression
  2. Decision Stump

For linear regression, compare the model selection methods: greedy, M5, no selection. Compare the models, how many and which independent variables were selected? Use the statistical indicators provided by Weka to perform the comparisons.

Your report should include all the results based on 10-fold cross-validation and on the test data set. You should also compare the results of all the methods.

II) Modeling assignment: Classification using decision trees

Due date: Feb. 23, '23

Part 1: Initial tree

This part of the project will allow you to predict a class (fp, nfp) using J48 (C4.5), a decision tree based classification algorithm.

Build a classification model using J48 (C4.5) using the fit data set and 10-fold cross validation. Determine the misclassification error rates (%) for both types of misclassifications from the confusion matrix.

  1. Type I: a nfp module is classified as fp
  2. Type II: a fp module is classified as nfp

Record the number of leaves and nodes in the selected tree, and represent the tree in the same way as in the textbook.

Repeat the previous tasks using the test data set to evaluate the model.

Part 2: Unpruned tree

Now in the J48 options, set the unpruned option to true. Rebuild the model in the same way as above, repeat all steps.

Now that you have represented the unpruned tree, compare with the tree generated above, and determine the part that was pruned.

Part 3: Confidence Factor

Now in the J48 options, set the confidence factor (C) to 0.01. Rebuild the model in the same way as for the initial tree (Part 1), repeat all the steps (of Part 1)

How does the size of the new tree compare to one built in Part 1? Explain why. What part was pruned?

Part 4: Cost sensitivity

Till now, we did not make any distinction between a Type I and a Type II error. However, in Software Quality Classification, a Type II error is more serious than a Type I error. Here, our objective is to obtain a balanced misclassification rates with Type II as low as possible.

Use the cost sensitive classifier combined with J48, and determine the optimal cost ratio (set cost of a type I error to 1 and vary the cost of the Type II error), using 10-fold cross validation on the fit data set. Observe the trends in the misclassification rates. What happens when the cost of a Type II error decreases/increases?

Evaluate all the models on the test data set.

*** For tips on performing cost sensitive classification, click here. ***

III) Modeling assignment: Using meta learning schemes with a strong and a weak learner for classification.

Due date: March 16, '23

This last section of the project will allow you to evaluate the benefits of using meta learners on this data set:

Using the same methods as above (10-fold cross-validation on the fit data), determine the preferred (optimal cost ratio) model, and evaluate the models using the test dataset for:

  1. Cost sensitive classifier combined with bagging and J48 
  2. Cost sensitive classifier combined with bagging and Decision Stump 
  3. Cost sensitive classifier combined with boosting (AdaBoostM1) and J48 
  4. Cost sensitive classifier combined with boosting (AdaBoostM1) and Decision Stump

Use the default settings for the meta learners (bagging, boosting, ) and the learner (J48, decision stump) but vary the cost ratio in the same way as in Part 4 of Assignment  II. Provide the command lines you used.

How do the results of each classifier compare to the cost-sensitive tree obtained in Part 4 of Assignment  II? Comment.

Now, set the number of iterations of each meta learner to 25 and repeat the experiments. Don't forget to analyze the results.

IV) Module Order Modeling

Due date: March 23, '23