Late submissions will incur a 10 marks penalty per day.
Weighting: 25% of the final course mark
Submission: When submitting the assessment the names and student IDs must be indicated
on the front page of the report.
COMP615 Assignment Two page 2 of 3
This assignment allows you to solve a real-world problem using the machine learning
workbench. The analysis and justifications of your answers carry a high proportion of the marks
awarded. Please make sure you read through the entire assignment before you start.
1 Introduction [5 marks] You are expected to provide information on the dataset you are assigned to use for your
assignment. Provide a statement of the problem, outlining the problem that your dataset
2 Data Exploration [5 marks] In this section of your report, you are required to discuss the dataset and any missing or
outliers found in the dataset.
• How many features (attributes), instances and what data types are these?
• Provide summary statistics of the continuous numerical features.
• Illustrate the features of your dataset using meaningful visualisation (eg. boxplots,
histograms, etc. ).
3 Decision Tree Classifier
You are required to build a model using the Decision Tree Classifier and answer the
following questions based on the model built. In building the model, use the 10-fold crossvalidation option for testing. Your answers need to be supported by suitable evidence,
wherever appropriate. Some examples of suitable evidence are the Confusion Matrices,
Model Visualizations and Summary Statistics.
a) Now build a model using the Decision Tree algorithm. Adjusting two suitable
parameters (one at a time) to reduce the size of the tree to improve the accuracy of your
model. Report the accuracy score for each parameter using the plots. Provide the final
optimized classification tree and describe its structure. [5 marks] b) Describe the role of the two parameters in the model building that you used in b) above.
Do you expect using the same values obtained for this dataset will improve the accuracy
for other types of datasets? Justify your answer. [8 marks]
c) Generate and examine the Confusion Matrix carefully and explain your findings.
Provide the model summary report and discuss the metrics (accuracy, precision, recall,
[8 marks] d) Find the feature importance based on the final classification model and explain your
findings. [ 4 marks] COMP615 Assignment Two page 3 of 3
4 Artificial Neural Network (ANN)
In this part, you are required to explore various architectures for building an Artificial
Neural Network (ANN). In building the model, use the 10-fold cross-validation option for
a) Use an appropriate feature selection method to identify the top five most significant
features. State the method used and list the features produced. Compare the list
produced in the previous section by the Decision Tree model. Identify similarities and
differences. Discuss any differences. [5 marks] b) Use the sklearn.MLPClassifier with default values for parameters and a single hidden
layer with k neurons (k <=25). Use default values for all parameters other than the number of iterations. Determine the best number for iteration that gives the highest accuracy. Use this classification accuracy as a baseline for comparison in later parts of this question. [5 marks] c) Enable the loss value to be shown on the training segment and track the loss as a function of the iteration count. You will observe that even when the loss value decreases the error value increases between consecutive iterations. Conversely, the error value may decrease when the loss increases between consecutive iterations. How do you explain this? [5 marks] d) Experiment with two hidden layers and experimentally determine the split of the number of neurons across each of the two layers that gives the highest classification accuracy. In part 1, we had all k neurons in a single layer, in this part we will transfer neurons from the first hidden layer to the second iteratively in step size of 1. Thus, for example in the first iteration, the first hidden layer will have k-1 neurons whilst the second layer will have 1, in the second iteration k-2 neurons will be in the first layer with 2 in the second and so on. Summarise your classification accuracy results in a 25 by 2 table with the first column specifying the combination of neurons used (e.g., 12, 13) and the second column specifying the classification accuracy. [8 marks] e) From the table created in part d of this part, you will observe the accuracy variation with the split of neurons across the two layers. Give explanations for some possible reasons for this variation. [4 marks] 5 Performance Evaluation Compare the performance of the Decision Tree and MLP Classifiers on your dataset. Choose the best-performing model for your dataset and explain why you have chosen it. Discuss the overall findings from your experiments. [4 marks]