weka3初体验以J48（C4.5）构造决策树以及setoptions各参赛含义_neolone

http://blog.sina.com.cn/u/1693252655

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

weka3初体验以J48（C4.5）构造决策树以及setoptions各参赛含义

(2014-03-31 10:20:45)

标签：

决策树

weka

j48

c4.5

it

分类：机器学习

初次使用weka3笔记

1.weka的安装使用

从官网下载完对应版本的weka之后，在eclipse中导入安装根目录下的weka.jar即可使用。

2.准备数据集

UCI中的数据比较怪异，使用c4.5格式导入也不正常。在.name中存放每一列的数据描述，在.data中存放数据，根据示例arff文件格式的我把它手动转换成了arff（等下写个程序自动转），注意arff格式没有指定类别，所以要手动设置。

UCI中的test在末尾类别上与data不同，多一个“.”

3.weka的基本使用步骤

-输入训练和测试用的instance

-使用Filter进行预处理（可选）

-选择某个分类器/聚类（Classifiers/Clusterer）并训练之

-Evaluating 评价

Details：

-输入训练和测试用的instance

ArffLoader atf = newArffLoader(); //Reads a source that is in arff (attribute relation file format) format.

File inputFile = newFile("adult.arff");//读入训练文件

atf.setFile(inputFile);

Instances instancesTrain = atf.getDataSet(); // 得到格式化的训练数据

instancesTrain.setClassIndex(instancesTrain.numAttributes()-1);//设置分类属性所在行号（第一行为0号），instancesTrain.numAttributes()可以取得属性总数

-使用Filter进行预处理（可选）

如下删除指定列

String[] options = new String[2];

options[0] = "-R"; // "range"

options[1] = "1"; // first attribute

Remove remove = new Remove(); // new instance of filter

remove.setOptions(options); // set options

// inform filter about dataset //**AFTER** setting options

remove.setInputFormat(data);

Instances newData = Filter.useFilter(data, remove); // apply filter

-选择某个分类器/聚类（Classifiers/Clusterer）并训练之

Classifier m_classifier = new J48();//J48用以建立一个剪枝或不剪枝的c4.5决策树

String options[]=new String[3];//训练参数数组

options[0]="-R";//使用reduced error pruning

options[1]="-M";//叶子上的最小实例数

options[2]="3";//set叶子上的最小实例数

m_classifier.setOptions(options);//设置训练参数

m_classifier.buildClassifier(instancesTrain); //训练

-Evaluating 评价

Evaluation eval = newEvaluation(instancesTrain); //构造评价器

eval.evaluateModel(m_classifier, instancesTest);//用测试数据集来评价m_classifier

System.out.println(eval.toSummaryString("=== Summary ===\n",false)); //输出信息

System.out.println(eval.toMatrixString("=== Confusion Matrix ===\n"));//Confusion Matrix

4.分类器参数设置

-U

  Use unpruned tree.使用未修剪过的决策树

-C

  Set confidence threshold for pruning.设置剪枝的阀值

对应于if(p0-pg划分后的熵的减小不明显，设置阀值threshold

用D中比例最大的Cj类作为叶子节点

  (default 0.25)

-M

  Set minimum number of instances per leaf.叶子上的最小实例数，如果某一个叶子节点小于该值，则判定其为噪声或错误数据将其剪去。

  (default 2)

-R

  Use reduced error pruning. Starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected then the change is kept. Reduced-Error Pruning(REP,错误率降低剪枝）

1：删除以此结点为根的子树

2：使其成为叶子结点

3：赋予该结点关联的训练数据的最常见分类

4：当修剪后的树对于验证集合的性能不会比原来的树差时，才真正删除该结点

默认为c4.5剪枝即基于错误剪枝 EBP(Error-Based Pruning) 分别计算三种预测错分样本数：

计算子树t的所有叶节点预测错分样本数之和，记为E1

计算子树t被剪枝以叶节点代替时的预测错分样本数，记为E2

计算子树t的最大分枝的预测错分样本数，记为E3

比较E1，E2，E3，如下：

E1最小时，不剪枝

E2最小时，进行剪枝，以一个叶节点代替t

E3最小时，采用“嫁接”(grafting)策略，即用这个最大分枝代替t

-N

  Set number of folds for reduced error

  pruning. Determines the amount of data used for reduced-error pruning.  One fold is used for pruning, the rest for growing the tree.

  (default 3)

-B

  Use binary splits only.所有节点都只做二元分割，有多个值的离散属性都只用==X和！=X来分割,会增大树

-S

  Don't perform subtree raising.

· Subtree replacement selects a subtree and replaces it with a single leaf.

· Subtree raising selects a subtree and replaces it with the child one (ie, a "sub-subtree" replaces its parent)

-L

  Do not clean up after the tree has been built. Whether to save the training data for visualization.没发现这个参数有什么用（是否为了展示保存训练数据）

-A

  Laplace smoothing for predicted probabilities.是否要使用拉普拉斯平滑

-Q

  Seed for random data shuffling (default 1). The seed used for randomizing the data when reduced-error pruning is used.

实例的源代码如下：http://pan.baidu.com/s/1c1ALRlY

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：学习Decision Tree的笔记和理解

后一篇：学习Naive Bayes的笔记和理解朴素贝叶斯分类

新浪BLOG意见反馈留言板　欢迎批评指正