使用支持向量机完成数据分类_托儿硕

http://blog.sina.com.cn/u/6012202141
首页博文目录关于我
个人资料
微博
加好友发纸条
写留言加关注
博客等级：
博客积分：
博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：
正文字体大小：大中小
使用支持向量机完成数据分类

(2017-02-19 15:43:45)
分类：分类算法
导入e1071包
使用svm函数训练支持向量机，trainset数据集作为输入数据集，churn是分类类别。

> model=svm(churn~.,data=trainset,kernel="radial",cost=1,gamma=1/ncol(trainset))

利用summary获得建好的分类模型的所有信息：
> summary(model)

Call:
svm(formula = churn ~ ., data = trainset, kernel = "radial", 
    cost = 1, gamma = 1/ncol(trainset))


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.05882353 

Number of Support Vectors:  348

 ( 140 208 )


Number of Classes:  2 

Levels: 
 yes no

使用iris数据集，调用subset函数获得iris数据集中的species值为setosa或virginica的阳历，选择样例在petal.length、petal.width、class列的投影：

> iris=read.csv("D://Rdata/iris.csv")

> str(iris)
'data.frame':    150 obs. of  5 variables:
 $ sepal.length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ speal.width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ petal.length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ petal.width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ class       : Factor w/ 3 levels "Iris-setosa",..: 1 1 1 1 1 1 1 1 1 1 ...

> iris.subset=subset(iris,select=c("petal.length",
+                                   "petal.width",
+                                   "class"),
+                    class %in% c("Iris-setosa","Iris-versicolor"))
> plot(x=iris.subset$petal.length,y=iris.subset$petal.width,
+      col=iris.subset$class,pch=19)

http://s14/mw690/006ySAqhzy78UgfzwUB0d&690

将惩罚因子设为1，利用iris.subset数据集训练SVM


> svm.model=svm(class~.,data=iris.subset,kernel="linear",
+               cost=1,scale=F)

将支持向量机用蓝色的圈标注出来：
> points(iris.subset[svm.model$index,c(1,2)],col="blue",cex=2)

http://s2/mw690/006ySAqhzy78Ugqmcy561&690

加分隔线


> w=t(svm.model$coefs) %*% svm.model$SV
> b=-svm.model$rho
> abline(a=-b/w[1,2],b=-w[1,1]/w[1,2],col="red",lty=5)

http://s10/mw690/006ySAqhzy78UguCtbre9&690

将惩罚因子设为10000，重新训练一个SVM分类器：


> plot(x=iris.subset$petal.length,y=iris.subset$petal.width,
+      col=iris.subset$class,pch=19)
> svm.model=svm(class~.,data=iris.subset,kernel="linear",
+               cost=10000,scale=F)
> points(iris.subset[svm.model$index,c(1,2)],col="blue",cex=2)
> w=t(svm.model$coefs) %*% svm.model$SV
> b=-svm.model$rho
> abline(a=-b/w[1,2],b=-w[1,1]/w[1,2],col="red",lty=5)

SVM模型的可视化

data(iris)

model.iris=svm(Species~.,iris)

plot(model.iris,iris,Petal.Width~Petal.Length,

     slice=list(Sepal.Width=3,Sepal.Length=4))

http://s16/mw690/006ySAqhzy78UhQhv1d5f&690

调用plot函数，绘制SVM对象model，x轴和y轴分别为tatal_intl_charge和total_day_minutes,


> plot(model,trainset,total_day_minutes~total_intl_charge)

http://s9/mw690/006ySAqhzy78Ui1ig9qc8&690

基于支持向量机训练模型实现类预测，churn数据集


> svm.pred=predict(model,testset[,!names(testset) %in% c("churn")])
> svm.table=table(svm.pred,testset$churn)
> svm.table
        
svm.pred yes  no
     yes  24   5
     no   53 418

调用classAgreement计算分类一致性系数：

> classAgreement(svm.table)
$diag
[1] 0.884

$kappa
[1] 0.4024807

$rand
[1] 0.794501

$crand
[1] 0.3443065
调用confusionMatrix基于分类表评测预测性能

> confusionMatrix(svm.table)
Confusion Matrix and Statistics

        
svm.pred yes  no
     yes  24   5
     no   53 418
                                          
               Accuracy : 0.884           
                 95% CI : (0.8526, 0.9107)
    No Information Rate : 0.846           
    P-Value [Acc > NIR] : 0.009104        
                                          
                  Kappa : 0.4025          
 Mcnemar's Test P-Value : 6.769e-10       
                                          
            Sensitivity : 0.3117          
            Specificity : 0.9882          
         Pos Pred Value : 0.8276          
         Neg Pred Value : 0.8875          
             Prevalence : 0.1540          
         Detection Rate : 0.0480          
   Detection Prevalence : 0.0580          
      Balanced Accuracy : 0.6499          
                                          
       'Positive' Class : yes  

调整支持向量机，调用tune.svm调整，

> tuned=tune.svm(churn~.,data=trainset,gamma=10^(-6:-1),
+                cost=10^(1:2))

使用summary函数得到调整后的模型相关信息：


> summary(tuned)

Parameter tuning of ‘svm’:

- sampling method: 10-fold cross validation 

- best parameters:
 gamma cost
  0.01  100

- best performance: 0.09605051 

- Detailed performance results:
   gamma cost      error dispersion
1  1e-06   10 0.15610101 0.04512672
2  1e-05   10 0.15610101 0.04512672
3  1e-04   10 0.15610101 0.04512672
4  1e-03   10 0.15610101 0.04512672
5  1e-02   10 0.10207071 0.02648941
6  1e-01   10 0.11010101 0.01879928
7  1e-06  100 0.15610101 0.04512672
8  1e-05  100 0.15610101 0.04512672
9  1e-04  100 0.15610101 0.04512672
10 1e-03  100 0.12911111 0.03439782
11 1e-02  100 0.09605051 0.02623349
12 1e-01  100 0.12110101 0.01957463

使用由tuning函数得到的最佳参数设置支持向量机

> model.tuned=svm(churn~.,data=trainset,gamma=tuned$best.parameters$gamma,
+                 cost=tuned$best.parameters$cost)
> summary(model.tuned)

Call:
svm(formula = churn ~ ., data = trainset, gamma = tuned$best.parameters$gamma, 
    cost = tuned$best.parameters$cost)


Parameters:
   SVM-Type:  C-classification 
 SVM-Kernel:  radial 
       cost:  100 
      gamma:  0.01 

Number of Support Vectors:  254

 ( 108 146 )


Number of Classes:  2 

Levels: 
 yes no

调用predict函数基于刚配置好的SVM模型进行类标号的预测

> svm.tuned.pred=predict(model.tuned,testset[,!names(testset) %in% c("churn")])

基于测试数据集的预测类别和世纪类别产生分类表：
> svm.tuned.table=table(svm.tuned.pred,testset$churn)
> svm.tuned.table
              
svm.tuned.pred yes  no
           yes  35  15
           no   42 408

调用classAgreement函数得到相关系数完成算法性能评测：

> classAgreement(svm.tuned.table)
$diag
[1] 0.886

$kappa
[1] 0.4892473

$rand
[1] 0.7975872

$crand
[1] 0.4171314

> confusionMatrix(svm.tuned.table)
Confusion Matrix and Statistics

              
svm.tuned.pred yes  no
           yes  35  15
           no   42 408
                                          
               Accuracy : 0.886           
                 95% CI : (0.8548, 0.9125)
    No Information Rate : 0.846           
    P-Value [Acc > NIR] : 0.0062932       
                                          
                  Kappa : 0.4892          
 Mcnemar's Test P-Value : 0.0005736       
                                          
            Sensitivity : 0.4545          
            Specificity : 0.9645          
         Pos Pred Value : 0.7000          
         Neg Pred Value : 0.9067          
             Prevalence : 0.1540          
         Detection Rate : 0.0700          
   Detection Prevalence : 0.1000          
      Balanced Accuracy : 0.7095          
                                          
       'Positive' Class : yes
阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report
前一篇：朴素贝叶斯分类算法
后一篇：神经网络模型
新浪BLOG意见反馈留言板　欢迎批评指正