用于聚类算法的编程结构

编程入门行业动态更新时间:2024-10-19 21:26:15

本文介绍了用于聚类算法的编程结构的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我正在尝试实现以下（分裂）聚类算法（以下简称为算法），完整描述可用 here ）：

I am trying to implement the following (divisive) clustering algorithm (below is presented short form of the algorithm, the full description is available here):

从样本x开始，i = 1，...，n被视为单个n个数据点的簇和为所有对点定义的不相似矩阵D。修正阈值T以决定是否拆分集群。

Start with a sample x, i = 1, ..., n regarded as a single cluster of n data points and a dissimilarity matrix D defined for all pairs of points. Fix a threshold T for deciding whether or not to split a cluster.

首先确定所有数据点对和选择一个最大距离（Dmax）的对。

First determine the distance between all pairs of data points and choose a pair with the largest distance (Dmax) between them.

将Dmax与T比较。如果Dmax> T，则使用所选的作为两个新集群中的第一个元素。剩下的n-2个数据点放在两个新的集群之一中。如果D（x_i，x_l）

Compare Dmax to T. If Dmax > T then divide single cluster in two by using the selected pair as the first elements in two new clusters. The remaining n - 2 data points are put into one of the two new clusters. x_l is added to the new cluster containing x_i if D(x_i, x_l) < D(x_j, x_l), otherwise is added to new cluster containing x_i.

在第二阶段，D（x_i，x_j）在群集中找到两个最大距离Dmax的两个新群集之一。如果Dmax < T，集群停止的划分，另一个集群被考虑。然后，该过程将重复从此迭代生成的集群。

At the second stage, the values D(x_i, x_j) are found within one of two new clusters to find the pair in the cluster with the largest distance Dmax between them. If Dmax < T, the division of the cluster stops and the other cluster is considered. Then the procedure repeats on the clusters generated from this iteration.

输出是集群数据记录的层次结构。我请教一个如何实现聚类算法的建议。

Output is a hierarchy of clustered data records. I kindly ask for an advice how to implement the clustering algorithm.

编辑1：我附加了Python函数，它定义了距离（相关系数）和在数据矩阵中找到最大距离的函数。

EDIT 1: I attach Python function which defines distance (correlation coefficient) and function which finds maximal distance in data matrix.

# Read data from GitHub import pandas as pd df = pd.read_csv('raw.githubusercontent/nico/collectiveintelligence-book/master/blogdata.txt', sep = '\t', index_col = 0) data = df.values.tolist() data = data[1:10] # Define correlation coefficient as distance of choice def pearson(v1, v2): # Simple sums sum1 = sum(v1) sum2 = sum(v2) # Sums of the squares sum1Sq = sum([pow(v, 2) for v in v1]) sum2Sq = sum([pow(v, 2) for v in v2]) # Sum of the products pSum=sum([v1[i] * v2[i] for i in range(len(v1))]) # Calculate r (Pearson score) num = pSum - (sum1 * sum2 / len(v1)) den = sqrt((sum1Sq - pow(sum1,2) / len(v1)) * (sum2Sq - pow(sum2, 2) / len(v1))) if den == 0: return 0 return num / den # Find largest distance dist={} max_dist = pearson(data[0], data[0]) # Loop over upper triangle of data matrix for i in range(len(data)): for j in range(i + 1, len(data)): # Compute distance for each pair dist_curr = pearson(data[i], data[j]) # Store distance in dict dist[(i, j)] = dist_curr # Store max distance if dist_curr > max_dist: max_dist = dist_curr

编辑2：是Dschoni的答案的功能。

EDIT 2: Pasted below are functions from Dschoni's answer.

# Euclidean distance def euclidean(x,y): x = numpy.array(x) y = numpy.array(y) return numpy.sqrt(numpy.sum((x-y)**2)) # Create matrix def dist_mat(data): dist = {} for i in range(len(data)): for j in range(i + 1, len(data)): dist[(i, j)] = euclidean(data[i], data[j]) return dist # Returns i & k for max distance def my_max(dict): return max(dict) # Sort function list1 = [] list2 = [] def sort (rcd, i, k): list1.append(i) list2.append(k) for j in range(len(rcd)): if (euclidean(rcd[j], rcd[i]) < euclidean(rcd[j], rcd[k])): list1.append(j) else: list2.append(j)

编辑3：当我运行@Dschoni提供的代码时，算法按预期工作。然后我修改了 create_distance_list 函数，所以我们可以计算多变量数据点之间的距离。我使用欧几里得距离。对于玩具示例，我加载 iris 数据。我只聚集数据集的前50个实例。

EDIT 3: When I run the code provided by @Dschoni the algorithm works as expected. Then I modified the create_distance_list function so we can compute distance between multivariate data points. I use euclidean distance. For toy example I load iris data. I cluster only the first 50 instances of the dataset.

import pandas as pd df = pd.read_csv('archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header = None, sep = ',') df = df.drop(4, 1) df = df[1:50] data = df.values.tolist() idl=range(len(data)) dist = create_distance_list(data) print sort(dist, idl)

结果如下： / p>

The result is as follows:

[[24]，[17]，[4]，[7]，[40]，[13]，[14] [15]，[26,27,38]，[3,16， 39]，[25]，[42]，[18，20，45]，[43]，[1，2，11 ，46]，[12,37,41]， [5]，[21]，[22]，[10,23,28,29]，[6,34,48]，[0,8 ，33,36,44]， [31]，[32]，[19]，[30]，[35]，[9，47]]

[[24], [17], [4], [7], [40], [13], [14], [15], [26, 27, 38], [3, 16, 39], [25], [42], [18, 20, 45], [43], [1, 2, 11, 46], [12, 37, 41], [5], [21], [22], [10, 23, 28, 29], [6, 34, 48], [0, 8, 33, 36, 44], [31], [32], [19], [30], [35], [9, 47]]

某些数据点仍然聚集在一起。我通过在 sort 函数中的实际字典中添加少量的数据噪音来解决这个问题：

Some data points are still clustered together. I solve this problem by adding small amount of data noise to actual dictionary in the sort function:

# Add small random noise for key in actual: actual[key] += np.random.normal(0, 0.005)

任何想法如何正确解决这个问题？

Any idea how to solve this problem properly?

推荐答案

欧几里德距离的正确工作示例：

A proper working example for the euclidean distance:

import numpy as np #For random number generation def create_distance_list(l): '''Create a distance list for every unique tuple of pairs''' dist={} for i in range(len(l)): for k in range(i+1,len(l)): dist[(i,k)]=abs(l[i]-l[k]) return dist def maximum(distance_dict): '''Returns the key of the maximum value if unique or a random key with the maximum value.''' maximum = max(distance_dict.values()) max_key = [key for key, value in distance_dict.items() if value == maximum] if len(max_key)>1: random_key = np.random.random_integers(0,len(max_key)-1) return (max_key[random_key],) else: return max_key def construct_new_dict(distance_dict,index_list): '''Helper function to create a distance map for a subset of data points.''' new={} for i in range(len(index_list)): for k in range(i+1,len(index_list)): m = index_list[i] n = index_list[k] new[(m,n)]=distance_dict[(m,n)] return new def sort(distance_dict,idl,threshold=4): result=[idl] i=0 try: while True: if len(result[i])>=2: actual=construct_new_dict(dist,result[i]) act_max=maximum(actual) if distance_dict[act_max[0]]>threshold: j = act_max[0][0] k = act_max[0][1] result[i].remove(j) result[i].remove(k) l1=[j] l2=[k] for iterr in range(len(result[i])): s = result[i][iterr] if s>j: c1=(j,s) else: c1=(s,j) if s>k: c2=(k,s) else: c2=(s,k) if actual[c1]<actual[c2]: l1.append(s) else: l2.append(s) result.remove(result[i]) #What to do if distance is equal? l1.sort() l2.sort() result.append(l1) result.append(l2) else: i+=1 else: i+=1 except: return result #This is the dataset a = [1,2,2.5,5] #Giving each entry a unique ID idl=range(len(a)) dist = create_distance_list(a) print sort(dist,idl)

我编写的代码是为了可读性，有很多东西可以做得更快，更多可靠和漂亮。这只是给你一个如何做的想法。

I wrote the code for readability, there is a lot of stuff that can made faster, more reliable and prettier. This is just to give you an idea of how it can be done.

更多推荐

用于聚类算法的编程结构

本文发布于:2023-11-29 16:52:11，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1647003.html