如何实现 K

编程入门 行业动态 更新时间:2024-10-05 07:25:50
本文介绍了如何实现 K-Means++ 算法?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我无法完全理解 K-Means++ 算法.我对第一个 k 质心是如何选择的很感兴趣,即初始化,其余的就像原始 K 均值算法.

I am having trouble fully understanding the K-Means++ algorithm. I am interested exactly how the first k centroids are picked, namely the initialization as the rest is like in the original K-Means algorithm.

  • 使用的概率函数是基于距离还是基于高斯?
  • 同时选择距离最远的点(来自其他质心)作为新质心.
  • 我将欣赏一步一步的解释和一个例子.维基百科中的那个不够清楚.此外,一个非常好的注释源代码也会有所帮助.如果您使用的是 6 个数组,请告诉我们哪个是用于什么的.

    I will appreciate a step by step explanation and an example. The one in Wikipedia is not clear enough. Also a very well commented source code would also help. If you are using 6 arrays then please tell us which one is for what.

    推荐答案

    有趣的问题.感谢您让我注意到这篇论文 - K-Means++:The优点细心播种

    Interesting question. Thank you for bringing this paper to my attention - K-Means++: The Advantages of Careful Seeding

    简单来说,簇中心最初是从一组输入观察向量中随机选择的,如果x不接近,则选择向量x的概率很高任何先前选择的中心.

    In simple terms, cluster centers are initially chosen at random from the set of input observation vectors, where the probability of choosing vector x is high if x is not near any previously chosen centers.

    这是一个一维的例子.我们的观察结果是 [0, 1, 2, 3, 4].设第一个中心 c1 为 0.下一个聚类中心 c2 为 x 的概率与 ||c1-x||^2.所以,P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, 其中a = 1/(1+4+9+16).

    Here is a one-dimensional example. Our observations are [0, 1, 2, 3, 4]. Let the first center, c1, be 0. The probability that the next cluster center, c2, is x is proportional to ||c1-x||^2. So, P(c2 = 1) = 1a, P(c2 = 2) = 4a, P(c2 = 3) = 9a, P(c2 = 4) = 16a, where a = 1/(1+4+9+16).

    假设 c2=4.那么,P(c3 = 1) = 1a,P(c3 = 2) = 4a,P(c3 = 3) = 1a,其中a = 1/(1+4+1).

    Suppose c2=4. Then, P(c3 = 1) = 1a, P(c3 = 2) = 4a, P(c3 = 3) = 1a, where a = 1/(1+4+1).

    我已经用 Python 编写了初始化过程;不知道对你有没有帮助.

    I've coded the initialization procedure in Python; I don't know if this helps you.

    def initialize(X, K): C = [X[0]] for k in range(1, K): D2 = scipy.array([min([scipy.inner(c-x,c-x) for c in C]) for x in X]) probs = D2/D2.sum() cumprobs = probs.cumsum() r = scipy.rand() for j,p in enumerate(cumprobs): if r < p: i = j break C.append(X[i]) return C

    编辑并澄清:cumsum 的输出为我们提供了划分区间 [0,1] 的边界.这些分区的长度等于相应点被选为中心的概率.那么,由于 r 在 [0,1] 之间统一选择,它将恰好落入这些区间之一(因为 break).for 循环检查r 所在的分区.

    EDIT with clarification: The output of cumsum gives us boundaries to partition the interval [0,1]. These partitions have length equal to the probability of the corresponding point being chosen as a center. So then, since r is uniformly chosen between [0,1], it will fall into exactly one of these intervals (because of break). The for loop checks to see which partition r is in.

    示例:

    probs = [0.1, 0.2, 0.3, 0.4] cumprobs = [0.1, 0.3, 0.6, 1.0] if r < cumprobs[0]: # this event has probability 0.1 i = 0 elif r < cumprobs[1]: # this event has probability 0.2 i = 1 elif r < cumprobs[2]: # this event has probability 0.3 i = 2 elif r < cumprobs[3]: # this event has probability 0.4 i = 3

    更多推荐

    如何实现 K

    本文发布于:2023-11-29 02:14:21,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1644890.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:如何实现

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!