C5.0 Decision Trees


The C5.0 Decision Trees  tool is an implementation of the Quinlan's C5.0 Algorithm, based in the code available in https://www.rulequest.com/ and in the C5.0 R Package, which source code is available in https://github.com/topepo/C5.0. It is composed of two phases. The Training step and the Classification Step.
The Training step consists in build the Decision Tree classifier based on a set of samples, contained in a Data Set (To select the samples, see Sample Selection Component).
The Classification Step consists in the use of the Decision Tree (or Rules based Model) built in the Training Step to the classification of a Data Set.


Using the interface

It is accessed through:

GeoDMA > (...) C5.0 Decision Trees

In the Training Tab:

  1. Vector Layer:  select the layer containing the samples that will be used to train the algorithm.
  2. Label Column:  select the column containing the samples that will be used to train the algorithm.
  3. Boosting: Define wether the adaptive boosting will be used. When activated, the algorithm will generate several classifiers, (either decision trees or rules sets) rather than just one. When a new case is to be classified, each classifier votes for its predicted class and the votes are counted to determine the final class. (For more detailed information, access this page)
  4. Trials: defines the number of boosting iterations. The minimum value is 2, once 1 means not using boosting.
  5. Global Pruning: defines wether the final global pruning step will be applied to simplify the tree. (For more detailed information, acecss this page )
  6. Minimum Cases: defines the smallest number of samples that must be put in at least two branches of a split.
  7. Training Samples: A value between 0 and 0.99 that defines the random proportion of samples should be used to train the model. Samples not used for training will be used to evaluate the accuracy of the model.
  8. Advanced Parameters: Opens the window that is used to set the Advanced Parameters of the algorithm. These Parameters are described bellow:
    • Winnow: defines whether a previous feature selection phase will be performed before building the model. (For more detailed information, access this page)
    • Early Stopping: defines whether the internal method for stopping boosting should be used.
    • Decompose Rules: defines whether the model will be generated as a Rule Set based model instead of a Decision tree. (For more detailed information, access this page )
    • Fuzzy Thresholds: defines whether to evaluate possible advanced splits of the data. See Quinlan (1993) for details and examples.
    • Subset: defines whether the model evaluate groups of discrete predictors for splits. Note: the C5.0 command line version defaults this parameter to FALSE, meaning no attempted groupings will be evaluated during the tree growing stage. (For more detailed information, access this page )
    • Confidence Factor: a number in (0, 1) that indicates the confidence with which predictions is made. (Note: The boosting option employs an artificial weighting of the training cases; if it is used, the confidence may not reflect the true accuracy of the rule.)
    • In the Advanced Parameters window, if you click Ok the parameters are saved, if you click Cancel, it returns to the Training tab without save the parameters.
  9. Input Features: Select, in the Available Features group, the features to be used in the Training proccess. After selecting the desired features, click (...) to move the selected features to the Used Features group. To remove features from the Used Features group, click (...). To move ALL the features from the Available Features group to the Used Features group, click (...). And, to move all the features from the Used Features to the Available Features group, click (...).
    In the end, the features in the group Used Features will be used to train the model.
  10. Click Run to perform the operation or Close to close the interface.

In the Results Window:

  1. Output Text: displays the output of the C5.0 algorithm, containing the output tree or rule based model, the accuracies in the training and test phases, and the confusion matrices of the training and test phases.
  2. Tree: displays the contents of the .tree (the decision tree file) file generated.
  3. Rules: displays the contents of the .rules (the Rule Set based file) file generated.
  4. Output Files: Select the output files to save the generated model. Click (...) to select the output file. It will generate three different files:
    • <output file>.txt: contains the output of the C5.0 algorithm.
    • <output file>.tree or .rules: contains the decision tree or the rule set based model. NOTE: Only one of these files is saved. If the .tree file is generated, the .rules is not generated.
    • <output file>.names: contains the features used to train the model.
  5. Click Back to go back to the Training/Classification window or Close to close the interface.

In the Classification Tab:

  1. Input Tree File:  select the file containing the tree model. It can be a .tree or a .rules file. If you just ran the training process, this file is already in memory, and don't need to be reloaded, unless you want to use a different model.
  2. Vector Layer:  select the layer to be classified. It is important to remember that the column names of the attributes used in the tree must be the same in this layer.
  3. Class Column:  define the name of the new column where the predicted classes will be saved.
  4. Repository: define the repository to save the classification. Click (...) to save into an ESRI Shape File or (...) to save into a database. If you want to save in the same input layer, you don't need to select a repository here. Just don't change the output layer name.
  5. Layer Name:  define the name of the layer where the classification data will be saved. If you want to save in the same input layer, just don't change the name here. Otherwise, set here the name of the new layer to be created.
  6. CSV File (Optional):  Define an output CSV file if you want to save in this format. Click (...) to select the output CSV file.
  7. Click Run to perform the operation or Close to close the interface.