模式识别与人工智能之五.pptx_文客久久网wenke99.com

资源描述

1、Pattern Recognition & artificial IntelligenceLecture 5: 聚类算法（一）,主要内容,聚类的定义聚类算法分类典型聚类算法讲解,聚类的定义,聚类的定义,典型的非监督式机器学习数据类别不被事先标识通过学习模型推断出数据的一些内在结构，进而进行聚类。,聚类算法分类,聚类算法分类,划分方法：首先得到初始的K个划分的集合。如K-平均、K-中心点、CLARANS以及对它们的改进。层次方法：创建给定数据对象集合的一个层次性的分解。根据层次分解的过程可以分为凝聚（自底向上）或分裂（自顶向下）。基于密度的方法：根据密度的概念来聚类对象，如DBSCAN、DENC

2、LUE、OPTICS。,聚类算法分类,基于网格的方法：首先将对象空间量化为有限数目的单元，形成网格结构，然后在网格结构上进行聚类，如STING、CLIQUE、WaveCluster。基于模型的方法：为每个簇假设一个模型，发现数据对模型的最好匹配，如COBWEB、CLASSIT，GCM, AutoClass, SOM。基于降维的方法：如Spectral clustering，Ncut等,聚类算法分类,划分聚类法 K-means,k-means算法，亦称k-均值或k-平均，是一种基于质心的启发式聚类算法。基本思想：通过迭代把数据集划分为不同的类别（或称簇），使得评价聚类性能的准则函数达到最优，使得

3、每个聚类类内紧凑，类间独立。对于连续型属性具有较好的聚类效果，不适合处理离散型属性。,划分聚类法 K-means,算法概述,平方误差和准则函数即SSE（sum of the squared error） SSE是数据库中所有对象的平方误差总和，其中为数据对象；为簇的平均值。这个准则函数使得生成的簇尽可能的紧凑和独立。,划分聚类法 K-means,准则函数,划分聚类法 K-means,算法流程,迭代计算中心点,收敛！,选择初始中心点,划分聚类法 K-means,Simple Example,划分聚类法 K-means,算法实现的具体流程,影响聚类效果！,一般采用欧氏距离、曼哈顿距离或者

4、名考斯距离的一种，作为样本间的相似性度量,划分聚类法 K-means,算法特点：影响主要因素,划分聚类法 K-means,不足：k-means算法只有在簇的平均值被定义的情况下才能使用。k-means算法的不足之处在于它要多次扫描数据库。k-means算法只能找出球形的类，而不能发现任意形状的类。初始质心的选择对聚类结果有较大的影响。k-means算法对于噪声和孤立点数据是敏感的，少量的该类数据能够对平均值产生极大的影响。,问题描述：如图所示，一只遥望大海的小狗。此图为100100像素的JPG图片，每个像素可以表示为三维向量（分别对应红绿蓝三基色）。要求使用k-means算法，将图片分割为

5、合适的背景区域（三个）和前景区域（小狗）。,划分聚类法 K-means,应用实例,Matlab程序实现,function M, j, e = kmeans(X, K, Max_Its)N,D=size(X);I=randperm(N);M=X(I(1:K),:);Mo = M;for n=1:Max_Its for k=1:K Dist(:,k) = sum(X - repmat(M(k,:),N,1).2,2); end i, j=min(Dist, , 2); for k=1:K if size(find(j=k)0 M(k, :) = mean(X(find(j=k), :); end

6、end,划分聚类法 K-means,Z = zeros(N,K); for m=1:N Z(m,j(m) = 1; end e = sum(sum(Z.*Dist)./N); fprintf(%d Error = %fn, n, e); Mo = M;end,Matlab程序实现（续）,划分聚类法 K-means,close all;clear all; clc;C_Segments=5;img_original = imread(dog.png);%读入图像figure,imshow(img_original),title(原始图像); %显示原图像m,n,depth=size(img_or

7、iginal ); % 获取图像的长宽% 将图像进行RGB3通道分解% 将RGB分量各转为kmeans使用的数据格式n行，一样本A = reshape(img_original(:, :, 1), m*n, 1); B = reshape(img_original(:, :, 2), m*n, 1);C = reshape(img_original(:, :, 3), m*n, 1);dat = A B C; % r g b分量组成样本的特征，每个样本有三个属性值，共width*height个样本cRGB = kmeans(double(dat), C_Segments, 20);rRGB =

8、 reshape(cRGB, m, n); % 反向转化为图片形式figure, imshow(label2rgb(rRGB),),title(分类结果); % 显示分割结果,划分聚类法 K-means,分割后的效果,应用实例,划分聚类法 K-means,划分聚类法 K-means,应用实例,注：聚类中心个数为5，最大迭代次数为10。,思路将聚类问题中的类定义为模糊集合，用模糊集的隶属度函数定量描述样本点与类之间的从属关系，并通过寻找使目标函数最小化的隶属度函数，实现聚类。算法关键点隶属度函数的数学定义模糊类中心的更新,划分聚类法模糊C均值聚类 fuzzy c-means,变量定义数据集X=

9、x1, x2, , xnc个模糊类样本xk对第i类的模糊隶属度为uik，满足条件隶属度矩阵U=uik 第i类的类中心为vi聚类中心矩阵为V=v1, v2, , vc建立基于隶属度矩阵U和聚类中心矩阵V的目标函数Jm(U,V),划分聚类法模糊C均值聚类 fuzzy c means,目标函数最小化求解,划分聚类法模糊C均值聚类 fuzzy c means,这里m1,是隶属度的加权指数; 为第i个聚类中心与第k个数据样本之间的欧几里得距离;,限定条件：,最小化上述函数可以用拉格朗日乘子法求解,目标函数最小化求解对上式进行求导，使其达到最小的必要条件为：,划分聚类法模糊C均值聚类 fuzzy c

10、 means,公式（1）,公式（2）,模糊C均值聚类算法具体步骤,划分聚类法模糊C均值聚类 fuzzy c-means (FCM),确定聚类类别数目c、加权指标m，用01的随机值初始化隶属矩阵U(0) ，并满足令迭代次数为b，b=0,1,2bmax根据公式（2）计算各个类的中心Vi(b);根据公式（1）更新U(b)为U(b+1)；比较U(b)和U(b+1)之间的差别，如果或者迭代达到最大次数，则聚类结束；否则，置b=b+1并返回第3步。,划分聚类法模糊C均值聚类 fuzzy c means,MATLAB中提供了FCM函数：center, U, obj_fcn = fcm(data, cl

11、uster_n, options);% 输入：% data - nxm矩阵,表示n个样本,每个样本具有m的维特征值% N_cluster - 标量,表示聚合中心数目,即类别数% options - 4x1矩阵，其中% options(1): 隶属度矩阵U的指数，1 (缺省值: 2.0)% options(2): 最大迭代次数 (缺省值: 100)% options(3): 隶属度最小变化量,迭代终止条件 (缺省值: 1e-5)% options(4): 每次迭代是否输出信息标志 (缺省值: 1)% 输出：% center - 聚类中心% U - 隶属度矩阵% obj_fcn - 目标函数值,划

12、分聚类法模糊C均值聚类 fuzzy c means,应用实例,close all;clear all; clc;C_Segments=4;img_original = imread(pepper.png);%读入图像figure,imshow(img_original),title(原始图像); %显示原图像m,n,p=size(img_original ); % 获取图像的长宽% 将图像进行RGB3通道分解% 将RGB分量各转为kmeans使用的数据格式n行，一样本A = reshape(img_original(:, :, 1), m*n, 1); B = reshape(img_ori

13、ginal(:, :, 2), m*n, 1);C = reshape(img_original(:, :, 3), m*n, 1);dat = A B C; % r g b分量组成样本的特征，每个样本有三个属性值，共width*height个样本,聚类法模糊C均值聚类 fuzzy c means,应用实例,center,U,fct = fcm (double(dat),C_Segments);, label = max(U, ,1);figure; LAB = reshape(label,m,n);imshow(LAB,)figure; map = 0 0 0; center(1,1)/2

14、55, center(1,2)/255,center(1,3)/255;imshow(LAB=1),colormap(map)figure; map = 0 0 0; center(2,1)/255, center(2,2)/255,center(2,3)/255;imshow(LAB=2),colormap(map)figure; map = 0 0 0; center(3,1)/255, center(3,2)/255,center(3,3)/255;imshow(LAB=3),colormap(map)figure; map = 0 0 0; center(4,1)/255, cente

15、r(4,2)/255,center(4,3)/255;imshow(LAB=4),colormap(map),聚类法模糊C均值聚类 fuzzy c means,应用实例,分割结果,划分聚类法 K-medoids,k-中心点(k-Medoids): 不采用簇中对象的平均值作为参照点, 而是选用簇中位置最中心的对象, 即中心点(medoid)作为参照点.,划分聚类法 K-medoids,基本思想:找聚类中的代表对象(中心点)PAM (Partitioning Around Medoids)首先为每个簇随意选择一个代表对象, 剩余的对象根据其与代表对象的距离分配给最近的一个簇; 然后反复地用非代表

16、对象来替代代表对象，以改进聚类的质量 PAM 对于较小的数据集非常有效, 但不能很好地扩展到大型数据集。,划分聚类法 K-medoids,为了判定一个非代表对象Orandom 是否是当前一个代表对象Oj的好的替代, 对于每一个非代表对象p,考虑下面的四种情况：第一种情况：p当前隶属于代表对象 Oj. 如果Oj被Orandom所代替, 且p离Oi最近, ij, 那么p被重新分配给Oi 第二种情况：p当前隶属于代表对象 Oj. 如果Oj 被Orandom代替, 且p离Orandom最近, 那么p被重新分配给Orandom,划分聚类法 K-medoids,第三种情况：p当前隶属于Oi，ij。如果O

17、j被Orandom代替，而p仍然离Oi最近，那么对象的隶属不发生变化第四种情况：p当前隶属于Oi，ij。如果Oj被Orandom代替，且p离Orandom最近，那么p被重新分配给Orandom,划分聚类法 K-medoids,算法: k-中心点(1) 随机选择k个对象作为初始的代表对象；(2) repeat(3) 指派每个剩余的对象给离它最近的代表对象所代表的簇；(4) 随意地选择一个非代表对象Orandom；(5) 计算用Orandom代替Oj的总距离E, 如果E比取代前下降则则用Orandom替换Oj，形成新的k个代表对象的集合，返回（4）；(6) until 不发生变化(7) 如果所

18、有非代表对象都无法取代已存在的簇中心，则结束替代过程，并输出结果,划分聚类法 K-medoids,0,1,2,3,4,5,6,7,8,9,10,0,1,2,3,4,5,6,7,8,9,10,K=2,Arbitrary choose k object as initial medoids,Assign each remaining object to nearest medoids,Randomly select a nonmedoid object,Oramdom,Compute total cost of swapping,Total Cost = 26,Swapping O and Ora

19、mdom If quality is improved.,Do loopUntil no change,划分聚类法 K-medoids,当存在噪音和孤立点时, PAM 比 k-平均方法更健壮. 这是因为中心点不象平均值那么容易被极端数据影响 PAM对于小数据集工作得很好, 但不能很好地用于大数据集每次迭代O(k(n-k)2 )其中 n 是数据对象数目, k 是聚类数,CLARA (Clustering LARge Applications),Improvement over PAMFinds medoids in a sample from the datasetIdea: If the s

20、amples are sufficiently random, the medoids of the sample approximate the medoids of the datasetHeuristics: 5 samples of size 40+2k gives satisfactory resultsWorks well for large datasets (n=1000, k=10),划分聚类法 K-medoids,CLARA (Clustering LARge Applications),划分聚类法 K-medoids,Draw multiple samples of th

21、e datasetSample should be able to represent the datasetApply PAM to each sampleReturn the best clustering,Set mincost to MAXIMUM;Repeat q times / draws q samplesCreate S by drawing s objects randomly from D;Generate the set of medoids M from S by applying the PAM algorithm;Compute cost(M,D)If cost(M

22、,D) mincost Mincost = cost(M, D); Bestset = M;End if;End repeat;Return Bestset;,Algorithms:,CLARA (Clustering LARge Applications),划分聚类法 K-medoids,s: the size of the samplek: number of clustersn:number of objects,Complexity of each iteration is: O(ks2+k(n-k),CLARA (Clustering LARge Applications),划分聚类

23、法 K-medoids,PAM find the best K medoids from a given dataset;CLARA finds the best kMedoids from several selected samples.,Problems:The best k-medoids may not be selected during the sampling Process, in this case, CLARA will never find the best clusteringIf the sampling is biased, we can not have a g

24、ood clustering.Trade off-of efficiency.,CLARANS (Clustering Large Applications based on Randomized Search),划分聚类法 K-medoids,It was proposed to improve the quality and scalability of CLARAIt combines sampling techniques with PAMIt does not confine itself to any sample at a given timeIt draws a sample

25、with some randomness in each step of the search,CLARANS draws sample in solution space dynamicallyA solution is a set of k medoidsThe solutions space contains Cnk solutions in totalThe solution space can be represented by a graph where every node is a potential solution, i.e., a set of k medoids,CLA

26、RANS,划分聚类法 K-medoids,Every node is a potential solution (k-medoid)Every node is associated with a squared errorTwo nodes are adjacent if they differ by one medoidEvery node has k(nk) adjacent nodes,O1,O2,Ok,Ok+1,O2,Ok,Ok+n,O2,Ok,n-k neighbors for one medoid,k(n k) neighbors for one node,划分聚类法 K-medo

27、ids: CLARANS,Start with a randomly selected node, check at most m neighbors randomlyIf a better adjacent node is found, moves to node and continue; otherwise, current node is local optimum; re-starts with another randomly selected node to search for another local optimumWhen h local optimum have bee

28、n found, returns best result as overall result,划分聚类法 K-medoids: CLARANS,Compare no more than maxneighbor times,Best Node,划分聚类法 K-medoids: CLARANS,Algorithms:,Set mincost to MAXIMUM; For i=1 to h do / find h local optimumRandomly select a node as the current node C in the graph;J = 1; / counter of ne

29、ighborsRepeatRandomly select a neighbor N of C;If Cost(N,D) mUpdate mincost with Cost(C,D) if applicable End for;End ForReturn bestnode;,划分聚类法 K-medoids: CLARANS,Algorithms:,Notes:,Each vertex is a set of k-representative objects (means, modes, medoids)Each iteration produces a new set of k-represen

30、tative objects with lower overall dissimilarityIterations correspond to a hill descent process in a landscape (graph) of vertices,划分聚类法 K-medoids: CLARANS,Advantages:CLARANS is more effective than both PAM and CLARAHandles outliersDisadvantages:Computational complexity of CLARANS is O(N2). The clustering quality depends on the sampling method,小结,掌握划分聚类算法的基本思想和局限性掌握常用的划分聚类的思想和优缺点以及适用范围 k-means, FCM, k-medoids (PAM, CLARA, CLARANS) 能够利用代码实现上述方法，并且结合实际应用设计相应的输入输出,

展开阅读全文