Introduction
Decision trees method is one of most popular approaches in machine learning. They can easily be used to solve different classification and regression tasks. Often, decision trees endear by their universality and by the fact that the model obtained by learning the decision tree is easy to interpret even by an unprepared person.
The universality of decision trees is a consequence of two main factors. First, the decision tree method is non-parametric machine learning method. It means that its usage does not need to know or assume the probabilistic characteristics of the data with which it is supposed to work. Second, the decision tree method naturally incorporates mixtures of variables with different levels of measurement [1].
At the same time, the decision tree model is a white-box, from which it is clear to understand for which particular data a particular class for the classification problem, or one or another value of the dependent variable for regression problem, will be predicted, which features or dependent variables have impact on this and how.
This article describes the decision trees algorithm and how Intel® Data Analytics Acceleration Library (Intel® DAAL) [2] helps optimize this algorithm when running it on systems equipped with Intel® Xeon® processors.
What is a Decision tree?
Decision trees partition the feature space into a set of hypercubes, and then fit a simple model in each one. Such a simple model can be a prediction model, which ignores all predictors and predicts the majority (most frequent) class (or mean of dependent variable for regression), also known as 0-R or constant classifier.
Decision tree induction constructs a tree-like graph structure as shown on the figure below where each internal (non-leaf) node denotes a test on features, each branch descending from node corresponds to an outcome of the test, and each external node (leaf) node donates the mentioned simple model.
The test is a rule, which depends on feature values, to perform the partitioning of the feature space: each outcome of the test represents an appropriate hypercube associated with both the test and one of descending branches. If the test is Boolean expression (e.g. f< c or f = c, where f is a feature and c is a constant fitted while decision tree induction), the inducted decision tree is a binary tree and so each its non-leaf node has exactly two branches (“true” and “false”) according to result of such a Boolean expression. In this case, often, left branch implicitly assumed to be associated with “true” outcome, while right branch implicitly assumed to be associated with “false” outcome.
Test selection is performed as a search through all reasonable tests to find best one according to some criterion, named split criterion. There are many widely used split criteria, including Gini index [3] and Information Gain [4] for classification, and Mean-Squared Error (MSE) [3] for regression.
To improve prediction, decision tree can be pruned [5]. Pruning technics that are embedded in the training process named pre-pruning, because they stop further growing of the decision tree. There are also post-pruning technics that replace already completely trained decision tree by another one [5].
For instance, Reduced Error Pruning (REP), described in [5], assumes an existence of a separate pruning dataset, each observation in which is used to get prediction by the original (unpruned) tree. For every non-leaf subtree, the change in mispredictions is examined over the pruning dataset that would occur if this subtree were replaced by the best possible leaf:
where ESubtree and Eleaf are numbers of errors in case of classification and MSE in case of regression respectively for given subtree and a best possible leaf, which replaces given subtree. If the new tree would give an equal or fewer mispredictions (DE £ 0) and subtree contains no subtree with the same property, the subtree is replaced by the leaf. The process continues until any further replacements would increase mispredictions over the pruning dataset. The final tree is the most accurate subtree of the original tree with respect to the pruning dataset and is the smallest tree with that accuracy. Pruning dataset can be some fraction of original training dataset (e.g. randomly chosen 20% of observations), but in this case those observations must be excluded from the training dataset.
The prediction is performed by starting at the root node of the tree, testing features by test specified by this node, then moving down the tree branch corresponding to the outcome of the test for the given example. This process is then repeated for the subtree rooted at the new node. The final result of the prediction of is the prediction of simple model at leaf node.
Applications of Decision trees
Decision trees can be used in many real-world applications [6]:
- Agriculture
- Astronomy (e.g. for filtering noise from Hubble Space Telescope images)
- Biomedical Engineering
- Control Systems
- Financial analysis
- Manufacturing and Production
- Medicine
- Molecular biology
- Object recognition
- Pharmacology
- Physics (e.g. for the detection of physical particles)
- Plant diseases (e.g. to assess the hazard of mortality to pine trees)
- Power systems (e.g. power system security assessment and power stability prediction)
- Remote sensing
- Software development (e.g. to estimate the development effort of a given software module)
- Text processing (e.g. medical text classification)
- Personal learning assistants
- Classifying sleep signals
Advantages and disadvantages of Decision trees
Using Decision trees has advantages and disadvantages [7]:
- Advantages
- Simple to understand and interpret. Have a white-box model.
- Able to handle both numerical and categorical data.
- Requires little data preparation.
- Non-statistical approach that makes no assumptions of the training data or prediction residuals; e.g., no distributional, independence, or constant variance assumptions
- Performs well even with large datasets.
- Mirrors human decision making more closely than other approaches.
- Robust against co-linearity.
- Have built in feature selection.
- Have value even with small datasets.
- Can be combined with other techniques.
- Disadvantages
- Trees do not tend to be as accurate as other approaches.
- Trees can be very non-robust. A small change in the training data can result in a big change in the tree, and thus a big change in final predictions.
- The problem of learning an optimal decision tree is known to be NP-complete under several aspects of optimality and even for simple concepts. Consequently, practical decision-tree learning algorithms are based on heuristics such as the greedy algorithm where locally-optimal decisions are made at each node.
- Decision-tree learners can create over-complex trees that do not generalize well from the training data. Mechanisms such as pruning are necessary to avoid this problem.
- There are concepts that are hard to learn because decision trees do not express them easily, such as XOR, parity or multiplexer problems. In such cases, the decision tree becomes prohibitively large.
Intel® Data Analytics Acceleration Library
Intel® DAAL is a library consisting of many basic building blocks that are optimized for data analytics and machine learning. Those building blocks are highly optimized for the latest features of latest Intel® processors. More about Intel® DAAL can be found in [2]. Intel® DAAL provides Decision tree classification and regression algorithms.
Using Decision trees in Intel® Data Analytics Acceleration Library
This section shows how to invoke Decision trees classification and regression using Intel® DAAL.
Do the following steps to invoke Decision tree classification algorithm from Intel® DAAL:
1. Ensure that you have Intel® DAAL installed and environment is prepared. See details in [8, 9, 10] according to your operating system. 2. Include header file daal.h into your application: #include <daal.h> 3. To simplify usage of Intel® DAAL namespaces we will use following using directives: using namespace daal; using namespace daal::algorithms; 4. We will assume that training, pruning and testing datasets are in appropriate .csv files. If so, we must read first and second of them into Intel® DAAL numeric tables: const size_t nFeatures = 5; /* Number of features in training and testing data sets */ /* Initialize FileDataSource<CSVFeatureManager> to retrieve the input data from a .csv file */ FileDataSource<CSVFeatureManager> trainDataSource(“train.csv”, DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext); /* Create Numeric Tables for training data and labels */ NumericTablePtr trainData(new HomogenNumericTable<>(nFeatures, 0, NumericTable::notAllocate)); NumericTablePtr trainGroundTruth(new HomogenNumericTable<>(1, 0, NumericTable::notAllocate)); NumericTablePtr mergedData(new MergedNumericTable(trainData, trainGroundTruth)); /* Retrieve the data from the input file */ trainDataSource.loadDataBlock(mergedData.get()); /* Initialize FileDataSource<CSVFeatureManager> to retrieve the pruning input data from a .csv file */ FileDataSource<CSVFeatureManager> pruneDataSource(“prune.csv”, DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext); /* Create Numeric Tables for pruning data and labels */ NumericTablePtr pruneData(new HomogenNumericTable<>(nFeatures, 0, NumericTable::notAllocate)); NumericTablePtr pruneGroundTruth(new HomogenNumericTable<>(1, 0, NumericTable::notAllocate)); NumericTablePtr pruneMergedData(new MergedNumericTable(pruneData, pruneGroundTruth)); /* Retrieve the data from the pruning input file */ pruneDataSource.loadDataBlock(pruneMergedData.get()); 5. Create an algorithm object to train the model: const size_t nClasses = 5; /* Number of classes */ /* Create an algorithm object to train the Decision tree model */ decision_tree::classification::training::Batch<> algorithm1(nClasses); 6. Pass the training data and labels with pruning data and labels to the algorithm: /* Pass the training data set, labels, and pruning dataset with labels to the algorithm */ algorithm1.input.set(classifier::training::data, trainData); algorithm1.input.set(classifier::training::labels, trainGroundTruth); algorithm1.input.set(decision_tree::classification::training::dataForPruning, pruneData); algorithm1.input.set(decision_tree::classification::training::labelsForPruning, pruneGroundTruth); 7. Train the model: /* Train the Decision tree model */ algorithm1.compute(); where algorithm1 is variable as defined in step 5. 8. Store result of training in variable: decision_tree::classification::training::ResultPtr trainingResult = algorithm1.getResult(); 9. Read testing dataset from appropriate .csv file: /* Initialize FileDataSource<CSVFeatureManager> to retrieve the test data from a .csv file */ FileDataSource<CSVFeatureManager> testDataSource(“test.csv”, DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext); /* Create Numeric Tables for testing data and labels */ NumericTablePtr testData(new HomogenNumericTable<>(nFeatures, 0, NumericTable::notAllocate)); testGroundTruth = NumericTablePtr(new HomogenNumericTable<>(1, 0, NumericTable::notAllocate)); NumericTablePtr mergedData(new MergedNumericTable(testData, testGroundTruth)); /* Retrieve the data from input file */ testDataSource.loadDataBlock(mergedData.get()); 10. Create an algorithm object to test the model: /* Create algorithm objects for Decision tree prediction with the default method */ decision_tree::classification::prediction::Batch<> algorithm2; 11. Pass the testing data and trained model to the algorithm: /* Pass the testing data set and trained model to the algorithm */ algorithm2.input.set(classifier::prediction::data, testData); algorithm2.input.set(classifier::prediction::model, trainingResult->get(classifier::training::model)); 12. Test the model: /* Compute prediction results */ algorithm2.compute(); 13. Retrieve the results of the prediction: /* Retrieve algorithm results */ classifier::prediction::ResultPtr predictionResult = algorithm2.getResult();
For decision tree regression, the steps 1-4, 7, 9, 12 are same, while other are very similar:
1. Ensure that you have Intel® DAAL installed and environment is prepared. See details in [8, 9, 10] according to your operating system. 2. Include header file daal.h into your application: #include <daal.h> 3. To simplify usage of Intel® DAAL namespaces we will use following using directives: using namespace daal; using namespace daal::algorithms; 4. We will assume that training, pruning and testing datasets are in appropriate .csv files. If so, we must read first and second of them into Intel® DAAL numeric tables: const size_t nFeatures = 5; /* Number of features in training and testing data sets */ /* Initialize FileDataSource<CSVFeatureManager> to retrieve the input data from a .csv file */ FileDataSource<CSVFeatureManager> trainDataSource(“train.csv”, DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext); /* Create Numeric Tables for training data and labels */ NumericTablePtr trainData(new HomogenNumericTable<>(nFeatures, 0, NumericTable::notAllocate)); NumericTablePtr trainGroundTruth(new HomogenNumericTable<>(1, 0, NumericTable::notAllocate)); NumericTablePtr mergedData(new MergedNumericTable(trainData, trainGroundTruth)); /* Retrieve the data from the input file */ trainDataSource.loadDataBlock(mergedData.get()); /* Initialize FileDataSource<CSVFeatureManager> to retrieve the pruning input data from a .csv file */ FileDataSource<CSVFeatureManager> pruneDataSource(“prune.csv”, DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext); /* Create Numeric Tables for pruning data and labels */ NumericTablePtr pruneData(new HomogenNumericTable<>(nFeatures, 0, NumericTable::notAllocate)); NumericTablePtr pruneGroundTruth(new HomogenNumericTable<>(1, 0, NumericTable::notAllocate)); NumericTablePtr pruneMergedData(new MergedNumericTable(pruneData, pruneGroundTruth)); /* Retrieve the data from the pruning input file */ pruneDataSource.loadDataBlock(pruneMergedData.get()); 5. Create an algorithm object to train the model: /* Create an algorithm object to train the Decision tree model */ decision_tree::regression::training::Batch<> algorithm; 6. Pass the training data and labels with pruning data and labels to the algorithm: /* Pass the training data set, dependent variables, and pruning dataset with dependent variables to the algorithm */ algorithm.input.set(decision_tree::regression::training::data, trainData); algorithm.input.set(decision_tree::regression::training::dependentVariables, trainGroundTruth); algorithm.input.set(decision_tree::regression::training::dataForPruning, pruneData); algorithm.input.set(decision_tree::regression::training::dependentVariablesForPruning, pruneGroundTruth); 7. Train the model: /* Train the Decision tree model */ algorithm1.compute(); where algorithm1 is variable as defined in step 5. 8. Store result of training in variable: decision_tree::regression::training::ResultPtr trainingResult = algorithm1.getResult(); 9. Read testing dataset from appropriate .csv file: /* Initialize FileDataSource<CSVFeatureManager> to retrieve the test data from a .csv file */ FileDataSource<CSVFeatureManager> testDataSource(“test.csv”, DataSource::notAllocateNumericTable, DataSource::doDictionaryFromContext); /* Create Numeric Tables for testing data and labels */ NumericTablePtr testData(new HomogenNumericTable<>(nFeatures, 0, NumericTable::notAllocate)); testGroundTruth = NumericTablePtr(new HomogenNumericTable<>(1, 0, NumericTable::notAllocate)); NumericTablePtr mergedData(new MergedNumericTable(testData, testGroundTruth)); /* Retrieve the data from input file */ testDataSource.loadDataBlock(mergedData.get()); 10. Create an algorithm object to test the model: /* Create algorithm objects for Decision tree prediction with the default method */ decision_tree::regression::prediction::Batch<> algorithm2; 11. Pass the testing data and trained model to the algorithm: /* Pass the testing data set and trained model to the algorithm */ algorithm.input.set(decision_tree::regression::prediction::data, testData); algorithm.input.set(decision_tree::regression::prediction::model, trainingResult->get(decision_tree::regression::training::model)); 12. Test the model: /* Compute prediction results */ algorithm2.compute(); 13. Retrieve the results of the prediction: /* Retrieve algorithm results */ decision_tree::regression::prediction::ResultPtr predictionResult = algorithm2.getResult();
Conclusion
Decision tree is a powerful method, which can be used for both classification and regression. Intel® DAAL optimized the decision tree algorithm. By using Intel® DAAL developers can take advantage of new features in future generations of Intel® Xeon® processors without having to modify their applications. They only need to link their applications to the latest version of Intel® DAAL.
References
- https://en.wikipedia.org/wiki/Level_of_measurement
- https://software.intel.com/en-us/blogs/daal
- Leo Breiman, Jerome H. Friedman, Richard A. Olshen, Charles J. Stone. Classification and Regression Trees. Chapman & Hall. 1984.
- J. R. Quinlan. Induction of Decision Trees. Machine Learning, Volume 1 Issue 1. pp. 81-106. 1986.
- J. R. Quinlan. Simplifying decision trees. International journal of Man-Machine Studies, Volume 27 Issue 3. pp. 221-234. 1987.
- http://www.cbcb.umd.edu/~salzberg/docs/murthy_thesis/survey/node32.html
- https://en.wikipedia.org/wiki/Decision_tree_learning
- https://software.intel.com/en-us/get-started-with-daal-for-linux
- https://software.intel.com/en-us/get-started-with-daal-for-windows
- https://software.intel.com/en-us/get-started-with-daal-for-macos