我无法完全理解 K-Means++ 算法.我对第一个 k 质心是如何选择的很感兴趣,即初始化,其余的就像原始 K 均值算法.
I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.
我将欣赏一步一步的解释和一个例子.维基百科中的那个不够清楚.此外,一个非常好的注释源代码也会有所帮助.如果您使用的是 6 个数组,请告诉我们哪个是用于什么的.
I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.
推荐答案有趣的问题.感谢您让我注意到这篇论文 - K-Means++:The优点细心播种
Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding
简单来说,簇中心最初是从一组输入观察向量中随机选择的,如果x不接近,则选择向量x的概率很高任何先前选择的中心.
In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.
这是一个一维的例子.我们的观察结果是 [0, 1, 2, 3, 4].设第一个中心 c1 为 0.下一个聚类中心 c2 为 x 的概率与 ||c1-x||^2.所以,P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, 其中a = 1/(1+4+9+16).
Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).
假设 c2=4.那么,P(c3 = 1) = 1a,P(c3 = 2) = 4a,P(c3 = 3) = 1a,其中a = 1/(1+4+1).
Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).
我已经用 Python 编写了初始化过程;不知道对你有没有帮助.
I've coded the initialization procedure in Python; I don't know if this helps you.
def initialize(X, K): C = [X[0]] for k in range(1, K): D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X]) probs = D2/D2.sum() cumprobs = probs.cumsum() r = scipy.rand() for j,p in enumerate(cumprobs): if r < p: i = j break C.append(X[i]) return C编辑并澄清:cumsum 的输出为我们提供了划分区间 [0,1] 的边界.这些分区的长度等于相应点被选为中心的概率.那么,由于 r 在 [0,1] 之间统一选择,它将恰好落入这些区间之一(因为 break).for 循环检查r 所在的分区.
EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.
示例:
probs = [0.1, 0.2, 0.3, 0.4] cumprobs = [0.1, 0.3, 0.6, 1.0] if r < cumprobs[0]: # this event has probability 0.1 i = 0 elif r < cumprobs[1]: # this event has probability 0.2 i = 1 elif r < cumprobs[2]: # this event has probability 0.3 i = 2 elif r < cumprobs[3]: # this event has probability 0.4 i = 3更多推荐
如何实现 K
发布评论