决策树回归模型(Decision Tree - Regression)

标签:
决策树回归 |
分类: 数据挖掘 |
Decision Tree - Regression |
Decision tree builds
regression or classification models in the form of a tree
structure. It brakes down a dataset into smaller and smaller
subsets while at the same time an associated decision tree is
incrementally developed. The final result is a tree
with |
http://www.saedsayad.com/images/Decision_tree_r1.pngTree |
|
Decision Tree Algorithm |
The core algorithm for
building decision trees
called |
Standard Deviation |
A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogenous). We use standard deviation to calculate the homogeneity of a numerical sample. If the numerical sample is completely homogeneous its standard deviation is zero. |
|
a) Standard deviation
for |
http://www.saedsayad.com/images/Decision_tree_r2.pngTree |
b) Standard deviation
for |
http://www.saedsayad.com/images/Decision_tree_r3.pngTree |
|
Standard Deviation Reduction |
The standard deviation reduction is based on the decrease in standard deviation after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest standard deviation reduction (i.e., the most homogeneous branches). |
Step 1: The standard
deviation of the target is
calculated. |
|
Standard deviation (Hours Played) = 9.32 |
|
Step 2: The dataset is
then split on the different attributes. The standard deviation for
each branch is calculated. The resulting standard deviation is
subtracted from the standard deviation before the split. The result
is the standard deviation reduction. |
http://www.saedsayad.com/images/Decision_tree_r4.pngTree |
http://www.saedsayad.com/images/Decision_tree_r5.pngTree |
Step 3: The attribute
with the largest standard deviation reduction is chosen for the
decision node. |
http://www.saedsayad.com/images/Decision_tree_r6.pngTree |
Step 4a: Dataset is divided based on the values of the selected attribute. |
http://www.saedsayad.com/images/Decision_tree_r7.pngTree |
Step 4b: A branch set
with standard deviation more than 0 needs further
splitting. |
In practice, we need some
termination criteria. For example, when standard deviation for the
branch becomes smaller than a certain fraction (e.g., 5%) of
standard deviation for the full
dataset |
http://www.saedsayad.com/images/Decision_tree_r8.pngTree |
Step 5: The process is run recursively on the non-leaf branches, until all data is processed. |
When the number of instances
is more than one at a leaf node we calculate
the |