Decision tree algorithm parameters. - Data Mining

There are a number of parameters for Microsoft Decision Trees. These parameters are used to control the tree growth, tree shape, and the input/output attribute settings. By adjusting these parameter settings, you can fine-tune the model accuracy. The following is the list of decision tree algorithm parameters.

  • Complexity_Penalty is used to control the tree growth. It is a floating number with range [0,1]. When its value is set close to 0, there is a lower penalty for the tree growth after model training; thus, you may see a large tree. When its value is set to close 1, the tree growth is penalized, and the final tree is relatively small. Generally speaking, large trees tend to have overtraining issues, whereas small tree may miss some patterns. The recommended way to tune the model is to try multiple trees with different settings and then use a lift chart to verify the model’s accuracy on testing data in order to pick the best one. The default setting is related to the number of input attributes. If there are fewer than 10 input attributes, the value is set to 0.5; if there are more than 100 attributes, the value is set to 0.99. If you have between 10 and 100 input attributes, the value is set to 0.9.
  • Minimum_Support is used to specify the minimum size of each leaf node in the tree. For example, if this value is set to 20, any tree split that can produce a child node containing less than 20 cases is not accepted. The default value for Minimum_Support Minimum_Leaf_Cases is 10. Usually, if the training dataset contains lots of cases, you will need to raise the value of this parameter to avoid oversplitting (overtraining).
  • Score_Method is a parameter of Integer type. It is used to specify the method for measuring a tree split score during the tree growth. We have discussed the concept of entropy in this chapter. To use an entropy score for tree growth, you need to set Score_Method = 1. There are a few other score methods supported by Microsoft Decision Trees: Bayesian K2, 3 (BK2) and Bayesian Dirichlet Equivalent with Uniform prior, 4 (BDEU). BK2 adds a constant for each state of the predictable attribute in a tree node, regardless the node level of the tree. BDEU adds weighted support to each predictable state based on the node level. The weight of the root node is higher than that of the leaf node; thus, the assigned prior (knowledge) is larger. The default value for Score_Method is 4, which is a BDEU method. Score_Method = 2(orthogonal) is no longer supported in SQL Server 2005.
  • Split_Method is a parameter with integer type. It is used to specify the tree shape, for example, whether the tree shape is binary or bushy. Split_Method = 1 means the tree is split only in a binary way. For example, Education is an attribute with three states: high school, undergraduate, and graduate. If the tree split is set to be binary, the algorithm may split the tree into two nodes with the criteria “Education = Undergraduate?” If the tree split is set to be complete (Split_Method = 2), the split on the Education attribute produces three nodes, one corresponding to each educational state. When Split_Method is set to 3 (the default setting), the decision tree will automatically choose the better of the first two methods to create the split.
  • Maximum_Input_Attribute is a threshold parameter of feature selection. When the number of input attributes is greater than this parameter value, feature selection is invoked implicitly to select the most significant input attributes.
  • Maximum_Output_Attribute is a threshold parameter of feature selection. When the number of predictable attributes is greater than this parameter value, feature selection is invoked implicitly to select the most significant attributes. Atree is built for each of the selected attributes.
  • Force_Regressor is a parameter for regression trees. It forces the regression and uses the specified attribute as the regressor. Suppose that you have a model to predict Income using Age, IQ, and other attributes. If you specify Force_Regressor = {Age, IQ}, you get regression formulas using Age and IQ for each leaf node of the tree.

All rights reserved © 2020 Wisdom IT Services India Pvt. Ltd DMCA.com Protection Status

Data Mining Topics