3月 01

谱聚类记录 – Spectral Clustering

参考资料:《Spectral Graph Theory 作者:Fan R.K. Chung》、《Spectra of graphs》、《THE LAPLACIAN SPECTRUM OF GRAPHS》、《A tutorial on spectral clustering》

一、Spectra of Graphs

​设G是一个无向图,The unnormalized graph Laplacian matrix of G is defined as:L = D – A,其中D是一个diagonal matrix,每一个entry D(i,i)是G中顶点Vi的度,A是邻接矩阵。

一些简单的结论,暂时只抄谱聚类tutorial里面的,以后要用了再加:

对于任意向量$f\in \mathbb{R}^n$ (n维实数空间中的向量f),满足

$f'Lf=\frac{1}{2} \sum\limits_{i,j=1}^n w_{ij}(f_i-f_j)^2$    (式1)

证明:$f'Lf=f'Df-f'Af=\sum_{i=1}^n d_i f_i^2-\sum_{i,j=1}^n f_i f_j w_{ij}$,提出一个1/2,把$\sum_{i=1}^n d_if_i^2$拆成两个,其中一个i换成j,得到 上式 = $1/2(\sum_{i=1}^n d_i f_i^2 – 2\sum_{i,j=1}^n f_i f_j w_{ij} + \sum_{j=1}^n d_j f_j^2)$, 然后即可得到 上式$=1/2 \sum_{i,j=1}^n w_{ij}(f_i-f_j)^2$,(因为$d_i=\sum_j w_{ij}$,所以$\sum_{i=1}^n d_i f_i^2 = \sum_i \sum_j w_{ij} f_i^2$)

推论1:根据半正定矩阵的定义($f'Lf \ge 0$),由式1可推知L为半正定矩阵。因为由邻接矩阵定义可知,$w_{ij}>=0$,而$(f_i-f_j)^2$为平方项,因此L满足$f'Lf \ge 0$。

推论2:L具有n个非负特征值,其中第一个特征值为0。这个好像是根据半正定矩阵的性质推得的,具体见矩阵论相关书籍。

其他结论开头列的参考资料里有好多。包括很多论文里都会用到Laplacian matrix,比如那个【NIPS'07】Learning with local and global consistency,但是这个具体的物理意义啥的还是不好懂,等真正需要知道的时候来搞。

二、Spectral Clustering Algorithms

其实谱聚类具体弄起来出奇的简单,tutorial论文里也说了,它比其他一些图分割的近似算法效果并不一定就好,但是Its popularity is mainly due to the fact that it results in a standard linear algebra problem which is simple to solve. 具体步骤如下:

这个没有实践过,但是从字面意思上来讲感觉应该比较好弄,具体做法应该就是:

假设要对n个点聚类成k个cluster,把L的前k个特征向量竖的排成一个矩阵,称为蓝色的那个样子,这样就成为了一个n*k的矩阵。所以图中蓝色为$u_k$,红色为$y_i$,把红色的向量看做k维空间点执行k-means即可!

三、Why Spectral Clustering Works?

1.Graph cut point of view

聚类的目的就是: we want to find a partition of the graph such that the edges between different groups have a very low weight (which means that points in different clusters are dissimilar from each other) and the edges within a group have high weight (which means that points within the same cluster are similar to each other).所以这个问题其实就是解决图的最小分割(solve the mincut problem)。

​对于解决k=2的最小分割问题,貌似有可行算法,而k>2时,有两种优化目标:RatioCut (Hagen and Kahng 1992) and the normalized cut Ncut (Shi and Malik 2000).

Unfortunately, introducing balancing conditions makes the previously simple to solve mincut problem become NP hard。也就是说,优化上述目标的简单方法通常会使得聚类得到的结果中有很多类只有一个object。如果要使每个类中object的数量得到平衡,问题就变成NP hard啦。所以Spectral clustering is a way to solve relaxed versions of those problems.即谱聚类其实只是上述优化目标的一种近似解法而已。 

2.Random walks point of view

只记录几句话,知道从random walk的角度是什么结论就行:

A balanced partition with a low cut will also have the property that the random walk does not have many opportunities to jump between clusters.

​​we actually look for a cut through the graph such that a random walk seldom transitions from A to ~A and vice versa.