This page provides the students from the
**CAP 6673:** **Data Mining and Machine Learning**** class**
with all the necessary information for the course.

**David
Wilson
E-mail:**

**E-mail is the preferred and the best way of reaching me. Before sending
me any questions, please read the textbook and relevant references carefully.**
**When sending me an E-mail, please include "CAP 6673" in the subject so that I can give it the necessary time and attention. **

The prescribed text book for this course is **Data Mining and Machine Learning** by Frank and Witten

Data Sets

The data sets are from a project labeled as CCCS. We will use two datasets:

**Fit Data set**to fit the models**Test Data set**to evaluate the performance of the selected model on fresh (unseen) data.

You may download the two datasets by clicking on the links below:

All the experiments for the assignments of this course will be performed using WEKA.

WEKA is an open source software issued under the GNU General Public License, and could be downloaded for free from http://www.cs.waikato.ac.nz/ml/weka/index.html.

If you do not already have the Java Runtime Environment (JRE) or have an old version, you should download the latest version of Weka that comes with JRE.

You could refer to Chapter 9 of the text book if you need assistance with the tool.

You may use both the Command Line Interface (CLI) and the Graphical User Interface (GUI).

** A number of research papers related to the coursework
can be found on the
References Page**.
Students are encouraged to read and understand these
papers in order to follow the material being covered / to be
covered later in this course. Also available under the reference section
are the slides for the topics covered in this course.

Due dates are not flexible. Please provide your reports on the due date. Make sure you organize yourself and start at once - **some experiments are very time consuming**
if you have a slow computer. Do not start two days before the due date.

All students should submit one .pdf file of your homework named yourname_hwk*.pdf to CANVAS by 11:59 PM on the due date. (NOTE: the * means the current homework number. For example, if you are handing in Homework 2, then "*" will be "2").

(1) The reports should include the detailed results of your experiments. Make sure you present your results in a synthetic way (not just a printout of the result provided by Weka).

(2) Experimental work presented without any analysis is useless. Please
**analyze** your results, and **draw meaningful conclusions.**

(3) There is no exact template for the reports, but you have to organize them in a way that makes sense. For example, in the report for the preparation of the datasets, you do not need to print the whole datasets (that is several pages of data), just print the header and a few instances.

(4) Do not forget to provide the methodology you used for your experiments.

(5) One can look at the references to get an idea of the way researchers present, summarize, and analyze the results.

The grading for the project will be based solely on your reports, so make sure you not only work hard, but also document your work well. Following are few important areas that will be considered while grading your report:

Well performed experiments following a good methodology

Good presentation of the experiments (use tables)

Synthetic analysis and comparison of the results of different models

Conclusions

A printout of the results without any explanation, comparison or conclusion **is
not **considered sufficient.

**Due date: Feb. 09, '21**

The assignement is to prepare the datasets in a format required to be used with Weka. You need to convert the files into the ARFF format described in the section 2.4 of the textbook. Prepare two sets of fit and test files. One set will be used to build and evaluate prediction models, and the other to build and evaluate classification models:

SET 1: [For Prediction]

**fit.arff**file to build the prediction models. You only need to reformat the original fit file to the ARFF format without any changes and add the required labels.

**test.arff**file to evaluate the prediction models. Same as above.

**SET 2: [For Classification]**

**fit.arff**file to build classification models. You need to add a column describing the class of each module: fault-prone (fp) or not fault-prone(nfp). Fault proneness is based on a threshold of number of faults. In this assignment, modules with less than 2 faults are considered nfp, and modules with 2 or more faults are considered fp. Make sure you do not use the number of faults column as an independent variable while doing classification.

**test.arff**file to evaluate the classification models. Same as above.

Please make sure you label the data correctly and comment the ARFF file (instances, attributes, date, author....).

The original data file has 9 columns. Following is the description of what each column represents (in the same order):

- Number of unique operators (NUMUORS)
- Number of unique operands (NUMUANDS)
- Total number of operators (TOTOTORS)
- Total number of operands (TOTOPANDS)
- McCabe's cyclomatic complexity (VG)
- Number of logical operators (NLOGIC)
- Lines of code (LOC)
- Executable line of code (ELOC)
- Number of faults (FAULTS)

A sample of an ARFF file for classification is available here.

**Due date: Feb. 09, '21**

This part of the project will build models to predict the number of faults based on the other attributes of the instances. Each model is to be first built and evaluated using 10-fold cross validation on the fit data set, and then validated using the test data set. Use the data sets prepared for prediction in the previous assignment of the project.

Build the following prediction models:

- Linear Regression
- Decision Stump

For linear regression, compare the model selection methods: greedy, M5, no selection. Compare the models, how many and which independent variables were selected? Use the statistical indicators provided by Weka to perform the comparisons.

Your report should include all the results based on 10-fold cross-validation and on the test data set. You should also compare the results of all the methods.

**Due date: Feb. 23, '21**

Part 1: Initial tree

This part of the project will allow you to predict a class (fp, nfp) using J48 (C4.5), a decision tree based classification algorithm.

Build a classification model using J48 (C4.5) using the fit data set and 10-fold cross validation. Determine the misclassification error rates (%) for both types of misclassifications from the confusion matrix.

- Type I: a nfp module is classified as fp
- Type II: a fp module is classified as nfp

Record the number of leaves and nodes in the selected tree, and represent the tree in the same way as in the textbook.

Repeat the previous tasks using the test data set to evaluate the model.

Part 2: Unpruned tree

Now in the J48 options, set the unpruned option to true. Rebuild the model in the same way as above, repeat all steps.

Now that you have represented the unpruned tree, compare with the tree generated above, and determine the part that was pruned.

Part 3: Confidence Factor

Now in the J48 options, set the confidence factor (C) to 0.01. Rebuild the model in the same way as for the initial tree (Part 1), repeat all the steps (of Part 1)

How does the size of the new tree compare to one built in Part 1? Explain why. What part was pruned?

Part 4: Cost sensitivity

Till now, we did not make any distinction between a Type I and a Type II error. However, in Software Quality Classification, a Type II error is more serious than a Type I error. Here, our objective is to obtain a balanced misclassification rates with Type II as low as possible.

Use the cost sensitive classifier combined with J48, and determine the optimal cost ratio (set cost of a type I error to 1 and vary the cost of the Type II error), using 10-fold cross validation on the fit data set. Observe the trends in the misclassification rates. What happens when the cost of a Type II error decreases/increases?

Evaluate all the models on the test data set.

***** For tips on performing cost sensitive
classification, click here.
*****

**Due date: March 09, '21**

This last section of the project will allow you to evaluate the benefits of using meta learners on this data set:

Using the same methods
as above (10-fold cross-validation on the fit data), determine the preferred
(optimal cost ratio) model, and evaluate the models using the test dataset
for:

- Cost sensitive classifier combined with bagging and J48
- Cost sensitive classifier combined with bagging and Decision Stump
- Cost sensitive classifier combined with boosting (AdaBoostM1) and J48
- Cost sensitive classifier
combined with boosting (AdaBoostM1) and Decision Stump

Use the default settings for the meta learners (bagging, boosting, ) and the learner (J48, decision stump) but vary the cost ratio in the same way as in Part 4 of Assignment II. Provide the command lines you used.

How do the results of each classifier compare to the cost-sensitive tree obtained in Part 4 of Assignment II? Comment.

Now, set the number of iterations of each meta learner to 25 and repeat the experiments. Don't forget to analyze the results.

**Due date: March 16, '21**

- This assignment involves ordering of software modules, based on
the number of faults predicted by a software quality prediction model.
Use the predictions obtained with Linear Regression Models we built in
Homework I-B. You will use only two of the three models you obtained in
Homework I-B. The models to be used for this assignment are as follows:

- Linear Regression Model with
__M5 Method of Attribute Selection__:

- FAULTS = - 0.0516 * NUMUORS + 0.0341 * NUMUANDS - 0.0027
* TOTOTORS - 0.0372 * VG + 0.2119 * NLOGIC + 0.0018 * LOC + 0.005 *
ELOC - 0.3091

- Linear Regression Model with
__Greedy Method of Attribute Selection__:

- FAULTS = - 0.0482 * NUMUORS + 0.0336 * NUMUANDS - 0.0021
* TOTOTORS - 0.0337 * VG + 0.2088 * NLOGIC + 0.0019 * LOC - 0.3255

- FAULTS = - 0.0482 * NUMUORS + 0.0336 * NUMUANDS - 0.0021
* TOTOTORS - 0.0337 * VG + 0.2088 * NLOGIC + 0.0019 * LOC - 0.3255

- Linear Regression Model with
- Obtain the predictions for both the fit data set and the test data set using the above two models. Perform Module Order Modeling for both fit and test data sets using both regression models.
- Compare the performances of MOM for both linear regression models. Use Alberg Diagram and Peformance Curve for each Model using fit and test data sets.
- Use tables to summarize the results of MOM. Also provide analysis
of your summary.

NOTE:- You should NOT start working on this assignment until you have read Reference #5 and #9.

- You DO NOT need to use Weka or any other software tool to do this homework. Any spreadsheet program like Microsoft Excel will suffice.