我正在尝试实现以下(分裂)聚类算法(以下简称为算法),完整描述可用 here ):
I am trying to implement the following (divisive) clustering algorithm (below is presented short form of the algorithm, the full description is available here):
从样本x开始,i = 1,...,n被视为单个n个数据点的簇和为所有对点定义的不相似矩阵D。修正阈值T以决定是否拆分集群。
Start with a sample x, i = 1, ..., n regarded as a single cluster of n data points and a dissimilarity matrix D defined for all pairs of points. Fix a threshold T for deciding whether or not to split a cluster.
首先确定所有数据点对和选择一个最大距离(Dmax)的对。
First determine the distance between all pairs of data points and choose a pair with the largest distance (Dmax) between them.
将Dmax与T比较。如果Dmax> T,则使用所选的作为两个新集群中的第一个元素。剩下的n-2个数据点放在两个新的集群之一中。如果D(x_i,x_l) Compare Dmax to T. If Dmax > T then divide single cluster in two by using the selected pair as the first elements in two new clusters. The remaining n - 2 data points are put into one of the two new clusters. x_l is added to the new cluster containing x_i if D(x_i, x_l) < D(x_j, x_l), otherwise is added to new cluster containing x_i. 在第二阶段,D(x_i,x_j)在群集中找到两个最大距离Dmax的两个新群集之一。如果Dmax < T,集群停止的划分,另一个集群被考虑。然后,该过程将重复从此迭代生成的集群。 At the second stage, the values D(x_i, x_j) are found within one of two new clusters to find the pair in the cluster with the largest distance Dmax between them. If Dmax < T, the division of the cluster stops and the other cluster is considered. Then the procedure repeats on the clusters generated from this iteration. 输出是集群数据记录的层次结构。我请教一个如何实现聚类算法的建议。 Output is a hierarchy of clustered data records. I kindly ask for an advice how to implement the clustering algorithm. 编辑1:我附加了Python函数,它定义了距离(相关系数)和在数据矩阵中找到最大距离的函数。 EDIT 1: I attach Python function which defines distance (correlation coefficient) and function which finds maximal distance in data matrix. 编辑2:是Dschoni的答案的功能。 EDIT 2: Pasted below are functions from Dschoni's answer. 编辑3: 当我运行@Dschoni提供的代码时,算法按预期工作。然后我修改了 create_distance_list 函数,所以我们可以计算多变量数据点之间的距离。我使用欧几里得距离。对于玩具示例,我加载 iris 数据。我只聚集数据集的前50个实例。 EDIT 3:
When I run the code provided by @Dschoni the algorithm works as expected. Then I modified the create_distance_list function so we can compute distance between multivariate data points. I use euclidean distance. For toy example I load iris data. I cluster only the first 50 instances of the dataset. 结果如下: / p> The result is as follows: [[24],[17],[4],[7],[40],[13],[14] [15],[26,27,38],[3,16, 39],[25],[42],[18,20,45],[43],[1,2,11 ,46],[12,37,41], [5],[21],[22],[10,23,28,29],[6,34,48],[0,8 ,33,36,44], [31],[32],[19],[30],[35],[9,47]]
[[24], [17], [4], [7], [40], [13], [14], [15], [26, 27, 38], [3, 16,
39], [25], [42], [18, 20, 45], [43], [1, 2, 11, 46], [12, 37, 41],
[5], [21], [22], [10, 23, 28, 29], [6, 34, 48], [0, 8, 33, 36, 44],
[31], [32], [19], [30], [35], [9, 47]]
某些数据点仍然聚集在一起。我通过在 sort 函数中的实际字典中添加少量的数据噪音来解决这个问题: Some data points are still clustered together. I solve this problem by adding small amount of data noise to actual dictionary in the sort function: 任何想法如何正确解决这个问题? Any idea how to solve this problem properly? 欧几里德距离的正确工作示例: A proper working example for the euclidean distance: 我编写的代码是为了可读性,有很多东西可以做得更快,更多可靠和漂亮。这只是给你一个如何做的想法。 I wrote the code for readability, there is a lot of stuff that can made faster, more reliable and prettier. This is just to give you an idea of how it can be done.
更多推荐
用于聚类算法的编程结构
发布评论