C5.0 Decision Trees
The C5.0 Decision Trees
tool is an implementation of the Quinlan's C5.0 Algorithm, based in the code available
in https://www.rulequest.com/ and in the
C5.0 R Package, which source code is available in
https://github.com/topepo/C5.0.
It is composed of two phases. The Training step
and the Classification Step.
The Training step consists in build the
Decision Tree classifier based on a set of samples, contained in a Data Set (To select
the samples, see
Sample Selection Component).
The Classification Step consists in the use of
the Decision Tree (or Rules based Model) built in the Training Step to the classification
of a Data Set.
Using the interface
It is accessed through:
GeoDMA >
C5.0 Decision Trees
In the
Training tab:
-
Vector Layer: select the layer
containing the samples that will be used to train the algorithm.
-
Label Column: select the column
containing the samples that will be used to train the algorithm.
-
Boosting: Define wether the adaptive boosting will
be used. When activated, the algorithm will generate several classifiers, (either decision
trees or rules sets) rather than just one. When a new case is to be classified, each
classifier votes for its predicted class and the votes are counted to determine the final
class. (For more detailed information, access
this page)
-
Trials: defines the number of boosting
iterations. The minimum value is 2, once 1 means not using boosting.
-
Global Pruning: defines wether the final
global pruning step will be applied to simplify the tree. (For more detailed information,
acecss this page )
-
Minimum Cases: defines the smallest number
of samples that must be put in at least two branches of a split.
-
Training Samples: A value between 0 and 0.99
that defines the random proportion of samples should be used to train the model. Samples
not used for training will be used to evaluate the accuracy of the model.
-
Advanced Parameters: Opens the window that
is used to set the Advanced Parameters of the algorithm. These Parameters are described
bellow:
-
Winnow: defines whether a previous feature
selection phase will be performed before building the model. (For more detailed
information, access
this page)
-
Early Stopping: defines whether the
internal method for stopping boosting should be used.
-
Decompose Rules: defines whether the
model will be generated as a Rule Set based model instead of a Decision tree.
(For more detailed information, access
this page )
-
Fuzzy Thresholds: defines whether to
evaluate possible advanced splits of the data. See Quinlan (1993) for details
and examples.
-
Subset: defines whether the model
evaluate groups of discrete predictors for splits. Note: the C5.0 command
line version defaults this parameter to FALSE, meaning no attempted groupings
will be evaluated during the tree growing stage. (For more detailed information,
access this page )
-
Confidence Factor: a number in (0, 1)
that indicates the confidence with which predictions is made. (Note: The boosting
option employs an artificial weighting of the training cases; if it is used,
the confidence may not reflect the true accuracy of the rule.)
-
In the Advanced Parameters window,
if you click Ok the parameters are saved, if you click Cancel,
it returns to the Training tab without save the parameters.
-
Input Features: Select, in the
Available Features group, the features to be used
in the Training proccess. After selecting the desired features, click
to move the
selected features to the Used Features
group. To remove features from the Used Features
group, click
.
To move ALL the features from the Available Features
group to the Used Features group, click
. And, to
move all the features from the Used Features
to the Available Features group, click
.
In the end, the features in the group Used Features
will be used to train the model.
-
Click Run to perform the operation or Close
to close the interface.
In the Results Window:
-
Output Text: displays the output of the
C5.0 algorithm, containing the output tree or rule based model, the accuracies in the
training and test phases, and the confusion matrices of the training and test phases.
-
Tree: displays the contents of the
.tree (the decision tree file) file generated.
-
Rules: displays the contents of the
.rules (the Rule Set based file) file generated.
-
Output Files: Select the output files
to save the generated model. Click
to select the
output file. It will generate three different files:
-
<output file>.txt: contains
the output of the C5.0 algorithm.
-
<output file>.tree or .rules: contains
the decision tree or the rule set based model.
NOTE: Only one of these files is saved.
If the .tree file is generated, the .rules is not generated.
-
<output file>.names: contains
the features used to train the model.
-
Click Back to go back to the Training/Classification window or Close
to close the interface.
In the
Classification Tab:
-
Input Tree File: select the file containing
the tree model. It can be a .tree or a .rules file. If you just ran the training process,
this file is already in memory, and don't need to be reloaded, unless you want to use
a different model.
-
Vector Layer: select the layer
to be classified. It is important to remember that the column names of the attributes
used in the tree must be the same in this layer.
-
Class Column: define the name of the
new column where the predicted classes will be saved.
-
Repository: define the repository to
save the classification. Click
to save into an
ESRI Shape File or
to save into a database.
If you want to save in the same input layer, you don't need to select a repository here.
Just don't change the output layer name.
-
Layer Name: define the name of the
layer where the classification data will be saved. If you want to save in the same
input layer, just don't change the name here. Otherwise, set here the name of the
new layer to be created.
-
CSV File (Optional): Define an output
CSV file if you want to save in this format. Click
to select the output
CSV file.
-
Click Run to perform the operation or Close
to close the interface.