K-means算法是硬聚类算法,是典型的基于原型的目标函数聚类方法的代表,它是数据点到原型的某种距离作为优化的目标函数,利用函数求极值的方法得到迭代运算的调整规则。K-means算法以偶是距离作为相似度测度,它是求对应某一初始聚类中心向量V最优分类,使得评价指标J最小。算法采用误差平方和准则函数作为聚类准则函数
K-means聚类算法采用的是将N*P的矩阵X划分为K个类,使得类内对象之间的距离最大,而类之间的距离最小。
kmeans 把 NaNs 当作丢失的数据,所以会忽略X
的包含NaNs的那些行的数据.
使用方法:
Idx=Kmeans(X,K)
[Idx,C]=Kmeans(X,K)
[Idx,C,sumD]=Kmeans(X,K)
[Idx,C,sumD,D]=Kmeans(X,K)
[…]=Kmeans(…,’Param1’,Val1,’Param2’,Val2,…)
各输入输出参数介绍:
X :N*P的数据矩阵
K: 表示将X划分为几类,为整数
Idx :N*1的向量,存储的是每个点的聚类标号
C: K*P的矩阵,存储的是K个聚类质心位置
sumD
1*K的和向量,存储的是类间所有点与该类质心点距离之和
D
N*K的矩阵,存储的是每个点与所有各自质心的距离
[…]=Kmeans(…,'Param1',Val1,'Param2',Val2,…)
这其中的参数Param1、Param2等,主要可以设置为如下:
1.
‘Distance’(距离测度)
‘sqEuclidean’ 欧式距离(默认时,采用此距离方式)
‘cityblock’ 绝度误差和,又称:L1
‘cosine’ 针对向量
‘correlation’ 针对有时序关系的值
‘Hamming’ 只针对二进制数据
2.
‘Start’(初始质心位置选择方法),有时候称为"种子".
'plus'
-
默认。根据k-means++算法,从X中选择K个观察点。第一个簇类中心均匀随机的选择,然后每个后续中心,从剩余的数据点中随机选择可能性最大的中心点。
'sample' - 从X中随机选取K个质心点.
'uniform' -
根据X的分布范围均匀的随机生成K个质心.不适合汉明距离。
cluster’
初始聚类阶段随机选择10%的X的子样本(此方法初始使用’sample’方法)
matrix - K×P
矩阵作为起始位置,作为初始质心位置集合. 可以将K设为为 [],kmeans
是指矩阵的第一个维度。你也可以设置3D数组,这意味着该数组的第三维度中的“Replicates”值
3.
‘Replicates’聚类重复次数
每次都初始化质心,形成新的集合。是一个正整数,默认为1.
4. 'EmptyAction' - 如果聚类失去了所有的观察数据。选择有:
'singleton'
- 创建新的聚类包括一个和更多靠近质心的点。
'error'
- 把空的聚类设置为错误.
'Options' - Options for the
iterative algorithm used to minimize the
fitting criterion, as created by
STATSET. Choices of STATSET
parameters are:
'Display' - Level of display
output. Choices are 'off', (the
default),
'iter', and 'final'.
'MaxIter' - Maximum number of
iterations allowed. Default is 100.
'UseParallel' - If true and if
a parpool of the Parallel Computing
Toolbox is
open, compute in parallel. If the
Parallel
Computing Toolbox is not installed, or a
parpool is
not open, computation occurs in serial
mode.
Default is 'default', meaning serial
computation.
'UseSubstreams' - Set to true
to compute in parallel in a reproducible
fashion.
Default is false. To compute reproducibly,
set
Streams to a type allowing substreams:
'mlfg6331_64' or 'mrg32k3a'.
'Streams' - These fields
specify whether to perform clustering
from
multiple 'Start' values in parallel, and how
to use
random numbers when generating the starting
points.
For information on these fields see
PARALLELSTATS.
NOTE: If
'UseParallel' is TRUE and 'UseSubstreams' is FALSE,
then the
length of 'Streams' must equal the number of
workers
used by
kmeans. If a parallel pool is already open,
this
will be
the size of the parallel pool. If a parallel
pool
is not
already open, then MATLAB may try to open a pool
for
you
(depending on your installation and preferences).
To ensure
more predictable results, it is best to use
the
PARPOOL command and explicitly create a parallel
pool
prior to
invoking kmeans with 'UseParallel' set to
TRUE.
'OnlinePhase' - Flag
indicating whether kmeans should perform an "on-line
update" phase in addition to a "batch update"
phase. The on-line phase
can be time consuming for large data sets, but
guarantees a solution
that is a local minimum of the distance
criterion, i.e., a partition of
the data where moving any single point to a
different cluster increases
the total sum of distances.
'off' (the default) or 'on'.
Example:
X = [randn(20,2)+ones(20,2);
randn(20,2)-ones(20,2)];
opts = statset('Display','final');
[cidx, ctrs] = kmeans(X, 2, 'Distance','city',
...
'Replicates',5, 'Options',opts);
plot(X(cidx==1,1),X(cidx==1,2),'r.', ...
X(cidx==2,1),X(cidx==2,2),'b.',
ctrs(:,1),ctrs(:,2),'kx');
加载中,请稍候......