MATLABK-means聚类_一切源于自然

http://blog.sina.com.cn/u/1728802184

首页博文目录关于我

个人资料

微博

加好友发纸条

写留言加关注

博客等级：
博客积分：

博客访问：
关注人气：
获赠金笔：0支
赠出金笔：0支
荣誉徽章：

正文字体大小：大中小

MATLABK-means聚类

(2019-07-31 21:42:17)

K-means算法是硬聚类算法，是典型的基于原型的目标函数聚类方法的代表，它是数据点到原型的某种距离作为优化的目标函数，利用函数求极值的方法得到迭代运算的调整规则。K-means算法以偶是距离作为相似度测度，它是求对应某一初始聚类中心向量V最优分类，使得评价指标J最小。算法采用误差平方和准则函数作为聚类准则函数

K-means聚类算法采用的是将N*P的矩阵X划分为K个类，使得类内对象之间的距离最大,而类之间的距离最小。

kmeans 把 NaNs 当作丢失的数据，所以会忽略X 的包含NaNs的那些行的数据.

使用方法：

Idx=Kmeans(X,K)

[Idx,C]=Kmeans(X,K)

[Idx,C,sumD]=Kmeans(X,K)

[Idx,C,sumD,D]=Kmeans(X,K)

[…]=Kmeans(…,’Param1’,Val1,’Param2’,Val2,…)

各输入输出参数介绍：

X :N*P的数据矩阵

K: 表示将X划分为几类，为整数

Idx :N*1的向量，存储的是每个点的聚类标号

C: K*P的矩阵，存储的是K个聚类质心位置

sumD 1*K的和向量，存储的是类间所有点与该类质心点距离之和

D N*K的矩阵，存储的是每个点与所有各自质心的距离

[…]=Kmeans(…,'Param1',Val1,'Param2',Val2,…)

这其中的参数Param1、Param2等，主要可以设置为如下：

1. ‘Distance’(距离测度)

‘sqEuclidean’ 欧式距离（默认时，采用此距离方式）

‘cityblock’ 绝度误差和，又称：L1

‘cosine’ 针对向量

‘correlation’ 针对有时序关系的值

‘Hamming’ 只针对二进制数据

2. ‘Start’（初始质心位置选择方法）,有时候称为"种子".

'plus' - 默认。根据k-means++算法，从X中选择K个观察点。第一个簇类中心均匀随机的选择，然后每个后续中心，从剩余的数据点中随机选择可能性最大的中心点。

'sample' - 从X中随机选取K个质心点.

'uniform' - 根据X的分布范围均匀的随机生成K个质心.不适合汉明距离。

cluster’ 初始聚类阶段随机选择10%的X的子样本（此方法初始使用’sample’方法）

matrix - K×P 矩阵作为起始位置，作为初始质心位置集合. 可以将K设为为 []，kmeans 是指矩阵的第一个维度。你也可以设置3D数组，这意味着该数组的第三维度中的“Replicates”值

3. ‘Replicates’聚类重复次数每次都初始化质心，形成新的集合。是一个正整数,默认为1.

4. 'EmptyAction' - 如果聚类失去了所有的观察数据。选择有:

'singleton' - 创建新的聚类包括一个和更多靠近质心的点。

'error' - 把空的聚类设置为错误.

'Options' - Options for the iterative algorithm used to minimize the

fitting criterion, as created by STATSET. Choices of STATSET

parameters are:

'Display' - Level of display output. Choices are 'off', (the

default), 'iter', and 'final'.

'MaxIter' - Maximum number of iterations allowed. Default is 100.

'UseParallel' - If true and if a parpool of the Parallel Computing

Toolbox is open, compute in parallel. If the

Parallel Computing Toolbox is not installed, or a

parpool is not open, computation occurs in serial

mode. Default is 'default', meaning serial

computation.

'UseSubstreams' - Set to true to compute in parallel in a reproducible

fashion. Default is false. To compute reproducibly,

set Streams to a type allowing substreams:

'mlfg6331_64' or 'mrg32k3a'.

'Streams' - These fields specify whether to perform clustering

from multiple 'Start' values in parallel, and how

to use random numbers when generating the starting

points. For information on these fields see

PARALLELSTATS.

NOTE: If 'UseParallel' is TRUE and 'UseSubstreams' is FALSE,

then the length of 'Streams' must equal the number of workers

used by kmeans. If a parallel pool is already open, this

will be the size of the parallel pool. If a parallel pool

is not already open, then MATLAB may try to open a pool for

you (depending on your installation and preferences).

To ensure more predictable results, it is best to use

the PARPOOL command and explicitly create a parallel pool

prior to invoking kmeans with 'UseParallel' set to TRUE.

'OnlinePhase' - Flag indicating whether kmeans should perform an "on-line

update" phase in addition to a "batch update" phase. The on-line phase

can be time consuming for large data sets, but guarantees a solution

that is a local minimum of the distance criterion, i.e., a partition of

the data where moving any single point to a different cluster increases

the total sum of distances. 'off' (the default) or 'on'.

Example:

X = [randn(20,2)+ones(20,2); randn(20,2)-ones(20,2)];

opts = statset('Display','final');

[cidx, ctrs] = kmeans(X, 2, 'Distance','city', ...

'Replicates',5, 'Options',opts);

plot(X(cidx==1,1),X(cidx==1,2),'r.', ...

X(cidx==2,1),X(cidx==2,2),'b.', ctrs(:,1),ctrs(:,2),'kx');

阅读┊ 收藏 ┊ 喜欢 ▼ ┊打印┊举报/Report

前一篇：可以解决nodejs读取文件时的中文乱码

后一篇：Latex引用参考文献-BibTex的使用

新浪BLOG意见反馈留言板　欢迎批评指正