k Nearest Neighbors Bean

A k-NN learning algorithm is a lazy classification algorithm. In the training phase, it just stores all the input data. When a new example needs to be classified, the algorithm finds the k nearest neighbors of the new example, and assigns it the majority class. The k nearest neighbors are determined based on an appropriate distance/similarity function.

Training examples are represented as pairs (x,c), where x=(x1,…xn ) is a attribute vector and c is the class assigned to x. Features can be discrete (taking a finite set of numeric values), categorical (taking a finite set of nominal values) or continuous (taking double values). However, the initial values are "filtered' before the k-NN algorithm starts. Thus, for discrete and categorical attributes, the possible values are converted to consecutive numbers starting from 1. For continuous attributes, the values are scaled to take values into [0,1] interval. The class attribute it is always discrete or categorical. This kind of data is read by the k-NN bean from its input buffer. For each new example processed, the output buffer contains the learned class followed by the actual class of that particular example. This output is further "filtered" and the numeric values are converted back to the original values of the class attribute. 

Given two examples x=(x1,…xn ) and y=(y1,…yn ), the distance between them can be defined as follows:  d(x,y) = sum (d(xi,yi)) where d(xi,yi)=1 if xi=yi and 0 if xi<>yi in the case of discrete and categorical attributes, and d(xi,yi)=|xi-yi| for continuous variables.

Implementation

The implementation of k-NN algorithm consists of three phases: training phase, test phase and run phase. In the training phase, the algorithm stores all the input data in a vector called model. During test or run phase, k-NN computes distances between a new example to be classified and all the examples stored in the model. The k nearest neighbors are chosen and asked to vote for the class of the new example. In the test phase, the correct class of the new examples is known as opposed to the run phase where it is not known and needs to be output. The correct classification given in the test phase is used to assess the correctness of the algorithm. If this is not satisfactory, the k value can be tuned until a reasonable level of correctness is achieved. The same value will be used in the run phase. This is the only parameter of the algorithm that can be tuned. All the other parameters are determined based on the input file (e.g. number of attributes, number of possible classes, number of examples/records) and cannot be changed unless the input file changes. The data files provided during test or run phases should be consistent with the input file.

Properties for Inspection

The following inspectable properties are available:

k parameter
The current number of neighbors the algorithm is looking at.
numRecords
The number of records in the input file.
numAttributes
The number of attributes that each example in the input file has.
numClasses
The number of all possible classes for data in the input file.
knnMode
The k-NN mode - train, test, or run.
numTestExamples
The number of processed test/run examples.
numCorrectTestExamples
The number of correctly classified test/run examples.
currentTestExample
The array containing the current test example.
currentTrainExample
The array containing the training example that is compared to the current test/run example.
currentLearnedClass
The majority class assigned to the current test/run example.
currentActualClass
The actual class of the current test example.
accuracy
The current accuracy of the algorithm on the test data set: numCorrectTestExample/numTestExamples.
error
The current error of the algorithm: 1-accuracy.