CEN 6405: Computer Performance Modeling

Summer 2023

Course Description

Use of software packages such as WEKA and R for data validation, description and analysis of statistical models used in computer science and software engineering.

Assistant:

Robert Kennedy: Email: rkennedy@fau.edu
(Please include "CEN 6405" in the subject line when you send me an Email.)

Syllabus

Syllabus

Announcements

It is your responsibility to regularly check the web page for any announcements. Assignments in the form of small projects, and/or exams will be posted on this page during the semester.

Mid-Term Exam date announced.

The Mid-Term Exam, which is going to be a Take-Home Exam, will be posted on CANVAS on June 2, 2023 by 1:00 pm.
You will get 24 hours to take the exam. The exams are to be submitted to CANVAS on June 3, 2023 by 1:00 pm.
All the students (Live and DisL) will take the exam on the same day. No Exceptions!!!

Projects

Small Project 1: Linear Regression Models DUE DATE : June 9, '23

In this project, you will get to use WEKA Tool.

Please see the References and Resources Section for guidelines on how to get this tool.

This assignment involves building and evaluating fault prediction models using Linear Regression, implemented in WEKA. Your task is to build models to predict the number of faults based on the other attributes of programs in the dataset. Each model is to be built and evaluated using 10-fold cross validation on the fit data set, and then validated using the test data set.

The datasets have already been preprocessed for use in Weka.
You could download the datasets from the link under References & Resources.

Use the fit dataset to build models based on 10-fold cross validation. When you build the model, you will get several statistical indicators, the measures of the quality of fit (in the case of fit data) and the predictive quality (for the test data), at the end of each run, as listed below:

Correlation coefficient
Mean absolute error (also called AAE, which stands for Average Absolute Error)
Root mean squared error
Relative absolute error
Root relative squared error

The Linear regression models could be built with three different options for attribute selection in WEKA.

No Attribute Selection
M5 method
Greedy method

You have to use each attribute selection method for building the models. Consequently, you will have three different models. Compare the models, how many and which independent variables were selected? After building the models, evaluate their performance by supplying the test data set. Compare the quality of fit and predictive quality for each model built. Also compare the qualities of fit and predictive qualities among all the different models respectively. Your comparisons should not be based on just one parameter. Use all the statistical indicators (mentioned hereabove) provided by Weka to perform the comparisons.
Don't forget to include all the results based on the 10-fold cross validation and the test data set for each model.

Small Project 2: Module Order Models DUE DATE: June 14, '23

This assignment involves ordering of software modules, based on the number of faults predicted by a software quality prediction model. Use the predictions obtained with Linear Regression Models we built in Small Project #1. You will use only two of the three models you obtained in Small Project #1. The models to be used for this assignment are as follows:

Linear Regression Model with M5 Method of Attribute Selection:

FAULTS = - 0.0516 * NUMUORS + 0.0341 * NUMUANDS - 0.0027 * TOTOTORS - 0.0372 * VG + 0.2119 * NLOGIC + 0.0018 * LOC + 0.005 * ELOC - 0.3091

Linear Regression Model with Greedy Method of Attribute Selection:

FAULTS = - 0.0482 * NUMUORS + 0.0336 * NUMUANDS - 0.0021 * TOTOTORS - 0.0337 * VG + 0.2088 * NLOGIC + 0.0019 * LOC - 0.3255

Obtain the predictions for both the fit data set and the test data set using the above two models. Perform Module Order Modeling for both fit and test data sets using both regression models.
Compare the performances of MOM for both the linear regression models. Use Alberg Diagram and Peformance Curve for each Model using fit and test data sets.
Use tables to summarize the results of MOM. Also provide analysis of your summary.

NOTE:

You should NOT start working on this assignment until you have read Reference #4 and #7.

You DO NOT need to use Weka or any other software tool to do this homework. Any spreadsheet program like Microsoft Excel will suffice.

Small Project 3: Using the General Classification Rule to classify modules as fault-prone or not-fault-prone. DUE DATE: June 20, '23
1. Convert the FIT and TEST datasets.
1. Build a Logistic Regression model using the FIT data and the Weka data mining tool.
At this point, you are DONE with Weka. You will not use Weka for the following steps!
1. Use the General Classification Rule to classify instances
                                                               i.      Introductory information (what are we doing, and how are we doing it?)

                                                             ii.      Tables and graphs summarizing your results.

                                                            iii.      The model you selected and the justification for that selection.

                                                           iv.      In depth discussion of the results, and meaningful conclusions.

NOTES:
1. The datasets you create in step 1 should NOT have both a FAULTS and a CLASS column. You are replacing the FAULTS column, not adding an additional column.
2. Remember, you are ONLY using Weka for step 2. Once you have the coefficients and intercept of the Logistic Regression model, you are done with Weka.
3. The use of a spreadsheet program (for example, Excel) will help you greatly in completing step 3.

Assignments

For these assignments, students are required to submit a 150-word (min.) to 300-word (max.) summary report of each student presentation to be made in the class.
Please refer to the section on guest speaker presentations on the References page.

The due date for each assignment will be the next class following each presentation.

All students can send a .pdf file to: khoshgof@fau.edu

Exams

Take-home Exam: DUE DATE: June 3, '23 - 1:00 pm or earlier
CLICK HERE FOR THE MID-TERM EXAM!!!
Exam # 2: DATE: June 16, '23

Grading

The grading will be based solely on your reports, so make sure you not only work hard but also document your work. I will be looking for the following things:

Well performed experiments following a good methodology
Good presentation of the experiments (Tabulate the results)
Synthetic analysis and comparison of the results of the different models
Conclusions

A printout of the results without any explanation, comparison or conclusion is not considered sufficient.

Due Dates

Due dates are not flexible. Please provide the reports by 11:59 pm on the due date or sooner. Some experiments could be very time consuming, especially if you have a slow computer. Therefore, you are advised not to procrastinate your work until the very last day. Late submissions will not be accepted under any circumstances.

References & Resources

References:

A number of pertinent research papers related to the coursework are now available under the References page. Students are encouraged to access and study these papers in order to follow the material being covered or to be covered later in the course.

WEKA Tool:

WEKA, developed by researchers at University of Waikato, New Zealand, stands for Waikato Environment for Knowledge Analysis.
Weka is an open source software issued under the GNU General Public License, and could be downloaded for free from http://www.cs.waikato.ac.nz/ml/weka/index.html.
For your projects, you should download the file weka-3-2-3jre.exe (11,305,621 bytes, including the Java Runtime Environment), which could be found under "Note for Windows users" heading on the above mentioned page.
You could also refer to the Weka Tutorial if you need assistance with the tool.

Datasets:

Weka requires that the datasets to be used be in ".arff" format. We have two datasets, namely Fit dataset (FIT.arff), for building the models, and Test dataset to evaluate their performance on the fresh data. The data sets are from a project labeled as CCCS. You could download them using the following links:

The metrics of the CCCS data set are listed below:

Number of unique operators (NUMUORS)
Number of unique operands (NUMUANDS)
Total number of operators (TOTOTORS)
Total number of operands (TOTOPANDS)
Mc Cabe's cyclomatic complexity (VG)
Number of logical operators (NLOGIC)
Lines of code (LOC)
Executable line of code (ELOC)
Number of faults (FAULTS)

This page is maintained by Dr. Taghi M. Khoshgoftaar
Updated: 05.29.2023