如何实现 K

编程入门行业动态更新时间:2024-10-05 07:25:50

本文介绍了如何实现 K-Means++ 算法?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我无法完全理解 K-Means++ 算法.我对第一个 k 质心是如何选择的很感兴趣，即初始化，其余的就像原始 K 均值算法.

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.

使用的概率函数是基于距离还是基于高斯?

同时选择距离最远的点(来自其他质心)作为新质心.

我将欣赏一步一步的解释和一个例子.维基百科中的那个不够清楚.此外，一个非常好的注释源代码也会有所帮助.如果您使用的是 6 个数组，请告诉我们哪个是用于什么的.

I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.

推荐答案

有趣的问题.感谢您让我注意到这篇论文 - K-Means++:The优点细心播种

Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding

简单来说，簇中心最初是从一组输入观察向量中随机选择的，如果x不接近，则选择向量x的概率很高任何先前选择的中心.

In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.

这是一个一维的例子.我们的观察结果是 [0, 1, 2, 3, 4].设第一个中心 c1 为 0.下一个聚类中心 c2 为 x 的概率与 ||c1-x||^2.所以，P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, 其中a = 1/(1+4+9+16).

Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).

假设 c2=4.那么，P(c3 = 1) = 1a，P(c3 = 2) = 4a，P(c3 = 3) = 1a，其中a = 1/(1+4+1).

Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).

我已经用 Python 编写了初始化过程；不知道对你有没有帮助.

I've coded the initialization procedure in Python; I don't know if this helps you.

def initialize(X, K): C = [X[0]] for k in range(1, K): D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X]) probs = D2/D2.sum() cumprobs = probs.cumsum() r = scipy.rand() for j,p in enumerate(cumprobs): if r < p: i = j break C.append(X[i]) return C

编辑并澄清:cumsum 的输出为我们提供了划分区间 [0,1] 的边界.这些分区的长度等于相应点被选为中心的概率.那么，由于 r 在 [0,1] 之间统一选择，它将恰好落入这些区间之一(因为 break).for 循环检查r 所在的分区.

EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.

示例:

probs = [0.1, 0.2, 0.3, 0.4] cumprobs = [0.1, 0.3, 0.6, 1.0] if r < cumprobs[0]: # this event has probability 0.1 i = 0 elif r < cumprobs[1]: # this event has probability 0.2 i = 1 elif r < cumprobs[2]: # this event has probability 0.3 i = 2 elif r < cumprobs[3]: # this event has probability 0.4 i = 3

更多推荐

如何实现 K

本文发布于:2023-11-29 02:14:21，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1644890.html