Assignment 13
Learning Objectives
- implement the kNN algorithm
- predict a nominal feature
- evaluate classification accuracy
Data Files & Tools
Tasks
- (0 pts) Download the data set Glass Identification Database along with its explanation. Note that the data file does not contain header names; you may wish to add those. The description of each column can be found in the data set explanation.
- (0 pts) Explore the data set as you see fit and that allows you to get a sense of the data and get comfortable with it. What are the distributions? Is there skewness? Do you need to transform the data? Are there outliers?
- (2 pts) Create a histogram of the Na column and overlay a normal curve; visually determine whether the data is normally distributed. You may use the code from this tutorial. Does the k-NN algorithm require normally distributed data or is it a non-parametric method? Comment on your findings.
- (3 pts) Identify outliers; describe your identification approach and how many and which columns were removed, and why.
- (5 pts) After removing the ID column (column 1), normalize the first two columns in the data set using min-max normalization.
- (5 pts) Normalize the remaining columns, except the last one, using z-score standardization. The last column is the glass type and so it is excluded.
- (10 pts) The data set is sorted, so creating a validation data set requires random selection of elements. Create a stratified sample where you randomly select 50% of each of the cases for each glass type to be part of the validation data set. The remaining cases will form the training data set.
- (30 pts) Implement the k-NN algorithm in R (do not use an implementation of k-NN from a package) and use your algorithm with a k=10 to predict the glass type for the following two cases:
RI = 1.51621 | 12.53 | 3.48 | 1.39 | 73.39 | 0.60 | 8.55 | 0.00 | Fe = 0.05
RI = 1.5098 | 12.77 | 1.85 | 1.81 | 72.69 | 0.59 | 10.01 | 0.00 | Fe = 0.01 - (10 pts) Apply the knn function from the class package with k=11 and redo the cases from Question (8).
- (12 pts) Determine the accuracy of the knn function with k=11 from the class package by applying it against each case in the validation data set. What is the percentage of correct classifications?
- (10 pts) Determine an optimal k by trying all values from 5 through 11 for your own k-NN algorithm implementation against the cases in the validation data set. What is the optimal k, i.e., the k that results in the best accuracy? Plot k versus accuracy.
- (5 pts) Create a plot of k (x-axis) versus error rate (percentage of incorrect classifications).
- (5 pts) Produce a cross-table confusion matrix showing the accuracy of the classification using a package of your choice and a k of your choice.
- (3 pts) Comment on the run-time complexity of the k-NN for classifying w new cases using a training data set of n cases having m features. Assume that m is "large". How does this algorithm behave as w, n, and m increase? Would this algorithm be "fast" if the training data set and the number of features are large?
Deliverables & Submission Instructions
Submit your .Rmd plus your .nb.html file generated by R Notebooks combined into a zip file. Upload the zip file to Blackboard.
Scoring
Total Number of Earnable Points: 100
Approximate Time to Complete: 4-6 hours
Due Date: see Calendar or Blackboard
Approximate Time to Complete: 4-6 hours
Due Date: see Calendar or Blackboard