决策树回归模型（Decision Tree - Regression）_华东

http://blog.sina.com.cn/u/2463286753

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

决策树回归模型（Decision Tree - Regression）

(2016-03-06 12:24:16)

标签：

决策树

回归

分类：数据挖掘

From： http://www.saedsayad.com/decision_tree_reg.htm

Decision Tree - Regression

Decision tree builds regression or classification models in the form of a tree structure. It brakes down a dataset into smaller and smaller subsets while at the same time an associated decision tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. A decision node (e.g., Outlook) has two or more branches (e.g., Sunny, Overcast and Rainy), each representing values for the attribute tested. Leaf node (e.g., Hours Played) represents a decision on the numerical target. The topmost decision node in a tree which corresponds to the best predictor called root node. Decision trees can handle both categorical and numerical data.

http://www.saedsayad.com/images/Decision_tree_r1.pngTree - Regression）" TITLE="决策树回归模型（Decision Tree - Regression）" />

Decision Tree Algorithm

The core algorithm for building decision trees called ID3 by J. R. Quinlan which employs a top-down, greedy search through the space of possible branches with no backtracking. The ID3 algorithm can be used to construct a decision tree for regression by replacing Information Gain with Standard Deviation Reduction.

Standard Deviation

A decision tree is built top-down from a root node and involves partitioning the data into subsets that contain instances with similar values (homogenous). We use standard deviation to calculate the homogeneity of a numerical sample. If the numerical sample is completely homogeneous its standard deviation is zero.

a) Standard deviation for one attribute:

http://www.saedsayad.com/images/Decision_tree_r2.pngTree - Regression）" TITLE="决策树回归模型（Decision Tree - Regression）" />

b) Standard deviation for two attributes:

http://www.saedsayad.com/images/Decision_tree_r3.pngTree - Regression）" TITLE="决策树回归模型（Decision Tree - Regression）" />

Standard Deviation Reduction

The standard deviation reduction is based on the decrease in standard deviation after a dataset is split on an attribute. Constructing a decision tree is all about finding attribute that returns the highest standard deviation reduction (i.e., the most homogeneous branches).

Step 1: The standard deviation of the target is calculated.

Standard deviation (Hours Played) = 9.32

Step 2: The dataset is then split on the different attributes. The standard deviation for each branch is calculated. The resulting standard deviation is subtracted from the standard deviation before the split. The result is the standard deviation reduction.

http://www.saedsayad.com/images/Decision_tree_r4.pngTree - Regression）" TITLE="决策树回归模型（Decision Tree - Regression）" />

http://www.saedsayad.com/images/Decision_tree_r5.pngTree - Regression）" TITLE="决策树回归模型（Decision Tree - Regression）" />

Step 3: The attribute with the largest standard deviation reduction is chosen for the decision node.

http://www.saedsayad.com/images/Decision_tree_r6.pngTree - Regression）" TITLE="决策树回归模型（Decision Tree - Regression）" />

Step 4a: Dataset is divided based on the values of the selected attribute.

http://www.saedsayad.com/images/Decision_tree_r7.pngTree - Regression）" TITLE="决策树回归模型（Decision Tree - Regression）" />

Step 4b: A branch set with standard deviation more than 0 needs further splitting.

In practice, we need some termination criteria. For example, when standard deviation for the branch becomes smaller than a certain fraction (e.g., 5%) of standard deviation for the full dataset OR when too few instances remain in the branch (e.g., 3).

http://www.saedsayad.com/images/Decision_tree_r8.pngTree - Regression）" TITLE="决策树回归模型（Decision Tree - Regression）" />

Step 5: The process is run recursively on the non-leaf branches, until all data is processed.

When the number of instances is more than one at a leaf node we calculate the average as the final value for the target.

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：按照关键词、发表期刊、发表年限等检索需求下载检索的文献的摘要解决办法

后一篇：多分类问题中，实现不同分类区域颜色填充的MATLAB代码（demo：Random Forest）

新浪BLOG意见反馈留言板　欢迎批评指正