R语言做聚类分析Kmeans时确定类的个数_qq的爱

http://blog.sina.com.cn/u/5346271813

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

R语言做聚类分析Kmeans时确定类的个数

(2018-06-05 09:57:44)

标签：

python

数据分析

数据分析师

数据科学家

杂谈

R语言做聚类分析Kmeans时确定类的个数

方法一：

K平均算法（k-means）
在下面的误差平方和图中，拐点（bend or elbow）的位置对应的x轴即k-means聚类给出的合适的类的个数。

> n = 100
> g=6
> set.seed(g)
> d <- data.frame(x = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))), y = unlist(lapply(1:g, function(i) rnorm(n/g, runif(1)*i^2))))
> plot(d)
>
http://www.cda.cn/uploadfile/image/20180604/20180604070535_66315.png
> mydata <- d
>
> wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
> for (i in 2:15)
wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
> ###这里的wss(within-cluster sum of squares)是组内平方和
> plot(1:15, wss, type="b", xlab="Number of Clusters",ylab="Within groups sum of squares")
>
http://www.cda.cn/uploadfile/image/20180604/20180604070524_36471.png
由上图可以看出，该方法给出合理的类别个数是4个。
方法二：

K中心聚类算法（K-mediods）
使用fpc包里的pamk函数来估计类的个数：

> library(cluster)
Warning message:
程辑包‘cluster’是用R版本3.2.3 来建造的
> library(fpc)
> pamk.best <- pamk(d)
> cat("number of clusters estimated by optimum average silhouette width:", pamk.best$nc, "\n")
number of clusters estimated by optimum average silhouette width: 4
> plot(pam(d, pamk.best$nc))
http://www.cda.cn/uploadfile/image/20180604/20180604070506_35979.png
sihouette值是用来表示某一个对象和它所属类的凝合力强度以及和其他类分离强度的，值范围为-1到1，值越大表示该对象越匹配所属类以及和邻近类有多不匹配。
所以从上图sihouette plot中可以看出，该方法给出的合理类的个数为4个。
方法三：
基于Calinsky Criterion
> require(vegan)
载入需要的程辑包：vegan
载入需要的程辑包：permute
载入需要的程辑包：lattice
This is vegan 2.4-0
Warning messages:
1: 程辑包‘vegan’是用R版本3.2.5 来建造的
2: 程辑包‘permute’是用R版本3.2.5 来建造的
3: 程辑包‘lattice’是用R版本3.2.3 来建造的
> fit <- cascadeKM(scale(d, center = TRUE, scale = TRUE), 1, 10, iter = 1000)
> plot(fit, sortg = TRUE, grpmts.plot = TRUE)
> calinski.best <- as.numeric(which.max(fit$results[2,]))
> cat("Calinski criterion optimal number of clusters:", calinski.best, "\n")
Calinski criterion optimal number of clusters: 5
>
http://www.cda.cn/uploadfile/image/20180604/20180604070450_62555.png
由上图我们可以看到，根据Calinsky标准，得到类的个数是5个。
方法四：

基于模型假设的聚类，利用的是mclust包：

> library(mclust)
    __ ___________    __ _____________
   / |/ / ____/ /   / / / / ___/_ __/
/ /|_/ / /   / /   / / / /\__ \ / /
/ / / / /___/ /___/ /_/ /___/ // /
/_/ /_/\____/_____/\____//____//_/    version 5.1
Type 'citation("mclust")' for citing this R package in publications.
Warning message:
程辑包‘mclust’是用R版本3.2.4 来建造的
> d_clust <- Mclust(as.matrix(d), G=1:20)
> m.best <- dim(d_clust$z)[2]
> cat("model-based optimal number of clusters:", m.best, "\n")
model-based optimal number of clusters: 4
> plot(d_clust)
Model-based clustering plots:

1: BIC
2: classification
3: uncertainty
4: density
http://www.cda.cn/uploadfile/image/20180604/20180604070437_21992.png
方法五：

基于AP算法的聚类

> library(apcluster)

载入程辑包：‘apcluster’

The following object is masked from ‘package:stats’:

    heatmap

Warning message:
程辑包‘apcluster’是用R版本3.2.5 来建造的
> d.apclus <- apcluster(negDistMat(r=2), d)
> cat("affinity propogation optimal number of clusters:", length(d.apclus@clusters), "\n")
affinity propogation optimal number of clusters: 4
> #4 得出的分类个数
> heatmap(d.apclus)
> plot(d.apclus, d)
>
http://www.cda.cn/uploadfile/image/20180604/20180604070425_61413.png

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：Python使用xlwt模块操作Excel的方法详解

后一篇：在Booking当数据科学家是怎样一种体验

新浪BLOG意见反馈留言板　欢迎批评指正