k Nearest Neighbors Bean
A k-NN learning algorithm is a lazy classification
algorithm. In the training phase, it just stores all the input data. When a new
example needs to be classified, the algorithm finds the k nearest neighbors of
the new example, and assigns it the majority class. The k nearest neighbors are
determined based on an appropriate distance/similarity function.
Training examples are represented as pairs (x,c), where x=(x1,…xn
) is a attribute vector and c is the class assigned to x. Features can be
discrete (taking a finite set of numeric values), categorical (taking a finite set of
nominal values) or continuous (taking double values). However, the
initial values are "filtered' before the k-NN algorithm starts. Thus, for
discrete and categorical attributes, the possible values are converted to
consecutive numbers starting from 1. For continuous attributes, the values are
scaled to take values into [0,1] interval. The class attribute it is always
discrete or categorical. This kind of data is read by the k-NN bean from its
input buffer. For each new example processed, the output buffer contains the
learned class followed by the actual class of that particular example. This
output is further "filtered" and the numeric values are converted back
to the original values of the class attribute.
Given two examples x=(x1,…xn
) and y=(y1,…yn
), the distance between them can be defined as follows: d(x,y) = sum (d(xi,yi)) where d(xi,yi)=1 if xi=yi and 0 if xi<>yi
in the case of discrete and categorical attributes, and d(xi,yi)=|xi-yi| for
continuous variables.
Implementation
The implementation of k-NN algorithm consists of three phases: training
phase, test phase and run phase. In the training phase, the
algorithm stores all the input data in a vector called model. During test
or run phase, k-NN computes distances between a new example to be classified and
all the examples stored in the model. The k nearest neighbors are chosen and
asked to vote for the class of the new example. In the test phase, the correct
class of the new examples is known as opposed to the run phase where it is not
known and needs to be output. The correct classification given in the test phase
is used to assess the correctness of the algorithm. If this is not satisfactory,
the k value can be tuned until a reasonable level of correctness is achieved.
The same value will be used in the run phase. This is the only parameter of the
algorithm that can be tuned. All the other parameters are determined based on
the input file (e.g. number of attributes, number of possible classes, number of
examples/records) and cannot be changed unless the input file changes. The data
files provided during test or run phases should be consistent with the input
file.
Properties for Inspection
The following inspectable properties are available:
- k parameter
- The current number of neighbors the algorithm is looking at.
- numRecords
- The number of records in the input file.
- numAttributes
- The number of attributes that each example in the input file has.
- numClasses
- The number of all possible classes for data in the input file.
- knnMode
- The k-NN mode - train, test, or run.
- numTestExamples
- The number of processed test/run examples.
- numCorrectTestExamples
- The number of correctly classified test/run examples.
- currentTestExample
- The array containing the current test example.
- currentTrainExample
- The array containing the training example that is compared to the
current test/run example.
- currentLearnedClass
- The majority class assigned to the current test/run example.
- currentActualClass
- The actual class of the current test example.
- accuracy
- The current accuracy of the algorithm on the test data set:
numCorrectTestExample/numTestExamples.
- error
- The current error of the algorithm: 1-accuracy.