## Assignment 13

## Learning Objectives

- implement the
*kNN*algorithm - predict a nominal feature
- evaluate classification accuracy

## Data Files & Tools

## Tasks

- (0 pts) Download the data set Glass Identification Database along with its explanation. Note that the data file does not contain header names; you may wish to add those. The description of each column can be found in the data set explanation.
- (0 pts) Explore the data set as you see fit and that allows you to get a sense of the data and get comfortable with it. What are the distributions? Is there skewness? Do you need to transform the data? Are there outliers?
- (2 pts) Create a histogram of the
*Na*column and overlay a normal curve; visually determine whether the data is normally distributed. You may use the code from this tutorial. Does the*k-NN*algorithm require normally distributed data or is it a non-parametric method? Comment on your findings. - (3 pts) Identify outliers; describe your identification approach and how many and which columns were removed, and why.
- (5 pts) After removing the ID column (column 1), normalize the first two columns in the data set using
*min-max normalization*. - (5 pts) Normalize the remaining columns, except the last one, using
*z*-score standardization. The last column is the glass type and so it is excluded. - (10 pts) The data set is sorted, so creating a validation data set requires random selection of elements. Create a stratified sample where you randomly select 50% of each of the cases for each glass type to be part of the validation data set. The remaining cases will form the training data set.
- (30 pts) Implement the
*k-NN*algorithm in R (do not use an implementation of*k-NN*from a package) and use your algorithm with a*k=10*to predict the glass type for the following two cases:

RI = 1.51621 | 12.53 | 3.48 | 1.39 | 73.39 | 0.60 | 8.55 | 0.00 | Fe = 0.05

RI = 1.5098 | 12.77 | 1.85 | 1.81 | 72.69 | 0.59 | 10.01 | 0.00 | Fe = 0.01 - (10 pts) Apply the
*knn*function from thepackage with**class***k=11*and redo the cases from Question (8). - (12 pts) Determine the accuracy of the
*knn*function with*k=11*from thepackage by applying it against each case in the validation data set. What is the percentage of correct classifications?**class** - (10 pts) Determine an optimal
*k*by trying all values from 5 through 11 for your own*k-NN*algorithm implementation against the cases in the validation data set. What is the optimal*k*,*i.e.*, the*k*that results in the best accuracy? Plot*k*versus accuracy. - (5 pts) Create a plot of
*k*(x-axis) versus error rate (percentage of incorrect classifications). - (5 pts) Produce a cross-table confusion matrix showing the accuracy of the classification using a package of your choice and a
*k*of your choice. - (3 pts) Comment on the run-time complexity of the
*k-NN*for classifying*w*new cases using a training data set of*n*cases having*m*features. Assume that*m*is "large". How does this algorithm behave as*w*,*n*, and*m*increase? Would this algorithm be "fast" if the training data set and the number of features are large?

## Deliverables & Submission Instructions

Submit your

*.Rmd*plus your*.nb.html*file generated by R Notebooks combined into a zip file. Upload the zip file to Blackboard. ## Scoring

*Total Number of Earnable Points*: 100

*Approximate Time to Complete*: 4-6 hours

*Due Date*: see Calendar or Blackboard