Assignment 13

Learning Objectives

implement the kNN algorithm
predict a nominal feature
evaluate classification accuracy

Data Files & Tools

Glass Identification Database

Tasks

(0 pts) Download the data set Glass Identification Database along with its explanation. Note that the data file does not contain header names; you may wish to add those. The description of each column can be found in the data set explanation.
(0 pts) Explore the data set as you see fit and that allows you to get a sense of the data and get comfortable with it. What are the distributions? Is there skewness? Do you need to transform the data? Are there outliers?
(2 pts) Create a histogram of the Na column and overlay a normal curve; visually determine whether the data is normally distributed. You may use the code from this tutorial. Does the k-NN algorithm require normally distributed data or is it a non-parametric method? Comment on your findings.
(3 pts) Identify outliers; describe your identification approach and how many and which columns were removed, and why.
(5 pts) After removing the ID column (column 1), normalize the first two columns in the data set using min-max normalization.
(5 pts) Normalize the remaining columns, except the last one, using z-score standardization. The last column is the glass type and so it is excluded.
(10 pts) The data set is sorted, so creating a validation data set requires random selection of elements. Create a stratified sample where you randomly select 50% of each of the cases for each glass type to be part of the validation data set. The remaining cases will form the training data set.
(30 pts) Implement the k-NN algorithm in R (do not use an implementation of k-NN from a package) and use your algorithm with a k=10 to predict the glass type for the following two cases:
RI = 1.51621 | 12.53 | 3.48 | 1.39 | 73.39 | 0.60 | 8.55 | 0.00 | Fe = 0.05
RI = 1.5098 | 12.77 | 1.85 | 1.81 | 72.69 | 0.59 | 10.01 | 0.00 | Fe = 0.01
(10 pts) Apply the knn function from the class package with k=11 and redo the cases from Question (8).
(12 pts) Determine the accuracy of the knn function with k=11 from the class package by applying it against each case in the validation data set. What is the percentage of correct classifications?
(10 pts) Determine an optimal k by trying all values from 5 through 11 for your own k-NN algorithm implementation against the cases in the validation data set. What is the optimal k, i.e., the k that results in the best accuracy? Plot k versus accuracy.
(5 pts) Create a plot of k (x-axis) versus error rate (percentage of incorrect classifications).
(5 pts) Produce a cross-table confusion matrix showing the accuracy of the classification using a package of your choice and a k of your choice.
(3 pts) Comment on the run-time complexity of the k-NN for classifying w new cases using a training data set of n cases having m features. Assume that m is "large". How does this algorithm behave as w, n, and m increase? Would this algorithm be "fast" if the training data set and the number of features are large?

Deliverables & Submission Instructions

Submit your .Rmd plus your .nb.html file generated by R Notebooks combined into a zip file. Upload the zip file to Blackboard.

Scoring

Total Number of Earnable Points: 100
Approximate Time to Complete: 4-6 hours
Due Date: see Calendar or Blackboard