Small Project 3:
Using the General Classification Rule to classify modules as fault-prone
or not-fault-prone. DUE DATE: June 20, '23
- Convert
the FIT and TEST datasets.
- For
this assignment, you will be building a classification model, rather
than a numeric prediction model.
Therefore it is necessary to convert the last column (FAULTS)
of the FIT and TEST data sets from a numeric value to a class value.
- To
do this, replace the FAULTS column with a new column called CLASS.
The values in this column should all be either “fp” or
“nfp” depending on the number of FAULTS for that instance.
For any instance where the number of FAULTS is 2 or more, the
class label should be “fp”, for any instance where the number of
FAULTS is less than 2, the class label should be “nfp”.
- When
you are done, your data set should still have 9 columns.
It will be identical to the original dataset except the FAULTS
column will be gone, and in its place there will be a CLASS column.
- You
will also have to change the header in the arff file.
Where it used to say “@attribute FAULTS numeric” is should
now say “@attribute CLASS {fp,nfp}”
- Build
a Logistic Regression model using the FIT data and the Weka data mining
tool.
- Open
Weka and use the new FIT data set (made in step 1)
- Build
your model like in the first assignment, but choose “Logistic”
instead of “Linear Regression”
- The
Weka tool will provide details about the model created and give
statistical results. Weka
gives the parameters ß0 (intercept) and ß1, ß2,
…, ßk (the coefficients) of the Logistic Regression
equation:
At this point,
you are DONE with Weka. You
will not use Weka for the following steps!
- Use
the General Classification Rule to classify instances
- If
you haven’t already done so, read Reference 02. Specifically, sections 2.1 through 2.4 will be very
valuable in performing this experiment.
- Use
the above Logistic Regression equation to calculate p for each
instance. (Using a
spreadsheet program like Excel will be very helpful in this step).
- Use
the General Classification rule (as described in class) to classify
each instance as either “fp” or “nfp” for different values of
c. Use the following
values for c: 0.1, 0.5, 1, 2, 3, 4, 5, 6, 10, 15, 20, 30, 40, 50
- Select
the value of c which generates the most appropriate model.
Your goal is to find a value of c where the Type 1 and Type 2
(false-positive and false-negative) error rates are balanced.
- Report
results for both the FIT and TEST data sets.
- Don’t
forget to include the following in your report:
i.
Introductory information (what are we doing, and how are we doing
it?)
ii.
Tables and graphs summarizing your results.
iii.
The model you selected and the justification for that selection.
iv.
In depth discussion of the results, and meaningful conclusions.
NOTES:
- The
datasets you create in step 1 should NOT have both a FAULTS and a CLASS
column. You are replacing
the FAULTS column, not adding an additional column.
- Remember,
you are ONLY using Weka for step 2.
Once you have the coefficients and intercept of the Logistic
Regression model, you are done with Weka.
- The
use of a spreadsheet program (for example, Excel) will help you greatly
in completing step 3.