Decision Tree Bean

A decision-tree learning algorithm builds a hypothesis or decision rule in DNF (Disjunctive Normal Form) represented as a tree graph. Each node of the tree is a test for a feature value (e.g., is feature f = value v?). Each path from the root of the tree to a leaf node is a rule made by the conjunction of all feature values found along the path. The leaf node specifies the class assigned to an example.

Training examples are represented as pairs (x,c), where x=(x1,…xn ) is a attribute vector and c is the class assigned to x. Features can be discrete (taking a finite set of numeric values), categorical (taking a finite set of nominal values) or continuous (taking double values). However, the initial values are "filtered' before the algorithm starts. Thus, for discrete and categorical attributes, the possible values are converted to consecutive numbers starting from 1. For continuous attributes, the values are discretized. The class attribute is always discrete or categorical. This kind of data is read by the Decision Tree bean from its input buffer. For each new example processed, the output buffer contains the learned class followed by the actual class of that particular example. This output is further "filtered" and the numeric values are converted back to the original values of the class attribute. 

A decision tree is built by recursively partitioning the training set until examples on each partition belong to the same class. At each node the algorithm selects the best feature to divide the training set into regions that are class uniform. The algorithm is then applied recursively on each region (for details see "C4.5: Programs for Machine Learning" by J. R. Quinlan, Morgan Kaufmann, 1993).

Implementation

The implementation of Decision Tree algorithm consists of three phases: training phase, test phase and run phase. In the training phase, the algorithm constructs the decision tree based on the training data. During test or run phase, the algorithm uses this decision tree to classify new unseen examples.  In the test phase, the correct class of the new examples is known as opposed to the run phase where it is not known and needs to be output. The correct classification given in the test phase is used to assess the correctness of the algorithm. 

The algorithm has two parameters that can be tuned: the metric parameter, which is used to find the best attribute at each node, and the discretization parameter, which represents the number of intervals used to discretize continuous features. All the other parameters are determined based on the input file (e.g. number of attributes, number of possible classes, number of examples/records) and cannot be changed unless the input file changes. The data files provided during test or run phases should be consistent with the input file.

Properties for Inspection

The following inspectable properties are available:

metric parameter
The metric used to find the best attribute at each node.
discretization parameter
Represents the number of intervals used to discretize continuous features.
numRecords
The number of records in the input file.
numAttributes
The number of attributes that each example in the input file has.
numClasses
The number of all possible classes for data in the input file.
decisionTreeMode
The decisionTree mode - train, test, or run.
numTestExamples
The number of processed test/run examples.
numCorrectTestExamples
The number of correctly classified test/run examples.
currentTestExample
The array containing the current test example.
currentTrainExample
The array containing the training example that is compared to the current test/run example.
currentLearnedClass
The majority class assigned to the current test/run example.
currentActualClass
The actual class of the current test example.
accuracy
The current accuracy of the algorithm on the test data set: numCorrectTestExample/numTestExamples.
error
The current error of the algorithm: 1-accuracy.