R:K 均值聚类与社区检测算法(加权相关网络)

编程入门行业动态更新时间:2024-10-26 01:23:14

本文介绍了R:K 均值聚类与社区检测算法(加权相关网络)-我是否将这个问题过于复杂?的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

限时送ChatGPT账号..

我的数据如下所示:https://imgur/a/1hOsFpF

第一个数据集是标准格式的数据集，其中包含人员及其财务属性的列表.

The first dataset is a standard format dataset which contains a list of people and their financial properties.

第二个数据集包含关系"；这些人之间——他们互相付出了多少，以及他们彼此欠了多少.

The second dataset contains "relationships" between these people - how much they paid to each other, and how much they owe each other.

我有兴趣了解更多关于基于网络和图的聚类 - 但我试图更好地了解什么类型的情况需要基于网络的聚类，即我不想在不需要的地方使用图聚类(避免出现方钉圆孔"型情况).

I am interested learning more about network and graph based clustering - but I am trying to better understand what type of situations require network based clustering, i.e. I don't want to use graph clustering where its not required (avoid a "square peg round hole" type situation).

使用 R，首先我创建了一些假数据:

Using R, first I created some fake data:

library(corrr)
 library(dplyr) 
library(igraph) 
library(visNetwork)
 library(stats)

# create first data set

Personal_Information <- data.frame(

"name" = c("John", "Jack", "Jason", "Jim", "Julian", "Jack", "Jake", "Joseph"),

"age" = c("41","33","24","66","21","66","29", "50"),

"salary" = c("50000","20000","18000","66000","77000","0","55000","40000"),

"debt" = c("10000","5000","4000","0","20000","5000","0","1000"

)


Personal_Information$age = as.numeric(Personal_Information$age)
Personal_Information$salary = as.numeric(Personal_Information$salary)
Personal_Information$debt = as.numeric(Personal_Information$debt)
create second data set
Relationship_Information <-data.frame(

"name_a" = c("John","John","John","Jack","Jack","Jack","Jason","Jason","Jim","Jim","Jim","Julian","Jake","Joseph","Joseph"),
"name_b" = c("Jack", "Jason", "Joseph", "John", "Julian","Jim","Jim", "Joseph", "Jack", "Julian", "John", "Joseph", "John", "Jim", "John"),
"how_much_they_owe_each_other" = c("10000","20000","60000","10000","40000","8000","0","50000","6000","2000","10000","10000","50000","12000","0"),
"how_much_they_paid_each_other" = c("5000","40000","120000","20000","20000","8000","0","20000","12000","0","0","0","50000","0","0")
)

Relationship_Information$how_much_they_owe_each_other = as.numeric(Relationship_Information$how_much_they_owe_each_other)
Relationship_Information$how_much_they_paid_each_other = as.numeric(Relationship_Information$how_much_they_paid_each_other)

然后，我运行了一个标准的 K-Means 聚类算法(在第一个数据集上)并绘制了结果:

Then, I ran a standard K-Means Clustering algorithm (on the first dataset) and plotted the results:

# Method 1 : simple k means analysis with 2 clusters on Personal Information dataset
cl <- kmeans(Personal_Information[,c(2:4)], 2)
plot(Personal_Information, col = cl$cluster)
points(cl$centers, col = 1:2, pch = 8, cex = 2)

这就是我通常处理这个问题的方式.现在，我想看看我是否可以对此类问题使用图聚类.

This is how I normally would have treated this problem. Now, I want to see if I can use graph clustering with this type of problem.

首先，我创建了一个加权相关网络(http://www.sthda/english/articles/33-social-network-analysis/136-network-analysis-and-manipulation-using-r/)

First, I created a weighted correlation network (http://www.sthda/english/articles/33-social-network-analysis/136-network-analysis-and-manipulation-using-r/)

首先，我创建了加权相关网络(使用第一个数据集):

First, I created the weighted correlation network (using the first dataset):

res.cor <- Personal_Information[, c(2:4)] %>%  
    t() %>% correlate() %>%            
    shave(upper = TRUE) %>%            
    stretch(na.rm = TRUE) %>%          
  filter(r >= 0.8)       

graph <- graph.data.frame(res.cor, directed=F)
graph <- simplify(graph)
plot(graph)

然后，我运行了图聚类算法:

Then, I ran the graph clustering algorithm:

#run graph clustering (also called communiy dectection) on the correlation network
 fc <- fastgreedymunity(graph)
 V(graph)$community <- fc$membership
 nodes <- data.frame(id = V(graph)$name, title = V(graph)$name, group = V(graph)$community)
 nodes <- nodes[order(nodes$id, decreasing = F),]
 edges <- get.data.frame(graph, what="edges")[1:2]

 visNetwork(nodes, edges) %>%
     visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)

这似乎有效 - 但我不确定这是否是解决此问题的最佳方法.

This seems to work - but I am not sure if it is the optimal way to approach this porblem.

有人可以提供一些建议吗?我把这个问题复杂化了吗?

Can someone provide some advice? Have I overcomplicated this problem?

谢谢

推荐答案

(首先从您描述的内容中了解问题本质的一些背景知识) 您有 2 个数据集，因此生成了 2 个数据结构:Personal_Information 和 Relationship_Information.您有一组实体 ，它们看起来是独一无二的，因为 Personal_Information 中没有名称重复，因此如果您知道这些实体之间有连接信息，我们可以参考它们作为网络中的节点，它们的互连性可以产生一个网络，其中存在社区检测算法可以发现/分配/检测的社区.所以，

(first some background to understand the nature of the problem from what you describe) You have 2 datasets and therefore produce 2 datastructures: Personal_Information and Relationship_Information. You have a set of entities which appear to be unique as there are no name repetitions in Personal_Information, therefore if you know that these entities have connectivity information between themselves we can refer to them as nodes in a network, where their interconnectivity can produce a network where there are communities that a community detection algorithm can uncover/allocate/detect. So,

Personal_Information，描述每个人(节点)Relationship_Information，描述它们的连接/关系(边)

在您在代码中提供的此信息的示例用法中，您似乎只使用了仅从 Personal_Information res 构建的 graph 数据.cor <- Personal_Information[, c(2:4)] %>% ... 而不是 Relationship_Information.这意味着您要在每个人的变量之间建立关系，这些变量是他们作为网络中的节点所固有的，而不是他们所拥有的数据是由于他们相互关联的交互而产生的.要了解你在这里做什么，你的方向就像在说；我将在人们的个性特征之间建立一个网络，并忽略他们之间的关联，即使我有数据.我将看看这些个性特征如何相互关联，然后查看哪些特征值组具有相互跟随的值(按组关联)

In the example usage of this information you supplied in your code, you appear to only be using the graph data which is built only from Personal_Information res.cor <- Personal_Information[, c(2:4)] %>% ... and not Relationship_Information. This means that you are building relationship between the variables of each person that is intrinsic to them as nodes in a network rather than what data they have produces as a result of their connected interactions. To understand what you are doing here, your direction is like saying; I am going to produce a network between the personality traits of people and ignore their associations between themselves even though I have the data. I am going to look at how those personality features correlate between themselves and then see which groups of feature values have values that follow each other (correlate in groups)

因此找到多个人的节点(人)的特征之间的相关性很好，然后生成该信息的矩阵也可以，然后从中生成图形/网络也可以.通过 fc <- fastgreedymunity(graph)，您生成的图形的结果(您调用 graph)是您获得的结果；每个人的哪些变量组是相关的.例如，var1 和 var2 之间有很强的相关性，但 var2 和 var3 之间有很强的负相关，所以 var2 和 var3 之间的边缘会推动它们处于独立的社区，并且将 var1 推到与 var3 不同的社区中，因为它与 var2(密友)密切相关.这些信息有什么用?它可以帮助您了解变量如何以组的形式存在，以便如果您有一个新人的 var2 值很低，而您不知道 var1 或 var3 的值；您会期望 var1 也会很低，而 var3 会很高.如果您采用人员数据的协方差，则可以采用特征向量并有效地进行 PCA，从而为您提供具有这种性质信息的向量.

So finding the correlations between the features of nodes (persons) over multiple people is fine, and then producing a matrix of that information is also ok, and then producing a graph/network from that is also ok. The result of that graph you produce (you call graph), via fc <- fastgreedymunity(graph) is that what you obtain is; which groups of variables of each person are co-correlated. Eg, var1 and var2 have a strong correlation between themsevles, but var2 and var3 have a strong negative correlation between themselves, so the edge between var2 and var3 is going to push them to be in separate communities and also push var1 to be in a separate community from var3 as it is tied strongly to var2 (close friend). How is this information useful? It can help you understand how the variables exist as groups so that if you have a new person who had a low value of var2 and you don't know the value of var1 or var3; you'd expect that var1 will be low as well and that var3 is high. If you took the covariance of the person data, you could take the eigenvectors and effectively do PCA which gives you a vector with information of this nature.

但是，这不会产生关于您在 Relationship_Information 数据中观察/测量的网络边缘的信息，这些数据描述了社区数据信息而不是节点数据.这个数据集看起来像一个邻接表，它是一个数据结构，将前2列列为col1中的节点源，col2中的节点目的地，col3中的边权重，如果你有相同的col2 和 col1(交换)中具有相同边权重的节点名称网络具有对称边(无向)，否则它是有向的.由于您的数据有 2 个边列(col3 和 col4)，您可以使用 col1、col2、col3 生成一个网络，使用 col1、col2、col4 生成另一个网络，或者……您可以使用

But, this is not producing information regarding the network edges you observed/measured in your Relationship_Information data which describes the community data information and not the node data. This dataset looks like an adjacency list, which is a datastructure that lists the first 2 columns as the node source in col1, node destination in col2, and the edge weight in col3, and if you have the same names of nodes in col2 and col1 (swapped) with the same edge weight the network has symmetric edges (undirected), else it is directed. Since your data, has 2 edges columns (col3 and col4) you can either produce one network with col1,col2,col3 and another with col1,col2,col4, or... you can produce a network with

adj_list1 = col1,col2,(col3-col4)，在该电子表格中使用 var 名称 $adj_list1 = name1,name2,(how_much_they_paid_each_other-how_much_they_owe_each_other)$或adj_list1 = col1,col2,(col3/col4) $adj_list1 = name1,name2,(how_much_they_paid_each_other/how_much_they_owe_each_other)$ adj_list1 = col1,col2,(col3-col4), using var names in that spreadsheet $adj_list1 = name1,name2,(how_much_they_paid_each_other-how_much_they_owe_each_other)$ or adj_list1 = col1,col2,(col3/col4) $adj_list1 = name1,name2,(how_much_they_paid_each_other / how_much_they_owe_each_other)$

这取决于您如何使用这些值定义边.您想从 adj1 或 adj2 生成网络，然后从该网络应用社区检测.把它想象成该数据集中的那些支付，类似于社交媒体上的那些互动，喜欢和提到将人们联系在一起.此处的社区结果显示了根据您使用的边在经济上联系在一起的社区的标签，您可以应用诸如 Louvain 算法 之类的算法来执行此操作.

and that is up to you how you define an edge with those values. You want to produce a network from adj1 or adj2 and then from that network apply the community detection. Think of it like those payments in that dataset as interactions similar to those on social media as being likes and mentions connecting people together. The community results here show the labels for the communities that are economically tied according to which edges you use, and you can apply an algorithm like the Louvain algorithm to do so.

但这不会同时使用节点的数据和边缘数据(人员数据和交换数据).他们在回答不同的问题.

But this does not use both the nodes' data and the edge data (person data and exchange data). They are answering different questions.

将 K-Means 应用于节点特征数据是在回答一个与社区检测算法不同的问题.

Applying K-Means to the node feature data is answering a different question from the community detection algorithm.

K-Means，每个人的这些变量值不是均匀分布的，它们集中在 K 个密集区域，其中中间区域具有稀疏样本.所以我们有类型社区检测，忽略这些人的特征是什么，让人们在他们的互动中聚集在一起，看看有多少个群体，所以如果人们在他们之间交换金钱，他们就专注于一个子群体.

因此，这些问题使用聚类和社区检测是独立的，因为它们使用独立收集的数据集.电子表格不相互依赖，也不依赖于数据.这并不意味着他们的数据没有交叉信息.您可以让这些特征影响边缘.因此，在介绍它时，您需要进行 2 项单独的调查.

So those questions are independent using clustering and community detection as they use independently collected datasets. The spread sheets do not depend upon each other or does the data. This does not mean that they data has no crossing information. You can have those features affect the edges. So when presenting it, you have 2 separate investigations.

(上面的另一个答案提到了基于融合的方法来分析节点数据和边缘数据一起的数据，但这似乎不是您的问题.您是否尝试将两个数据集一起使用?如果是这样，最简单的方法是使用一种具有良好实现的方法，并且像 SGC、简单的图形卷积神经网络这样的图形神经网络"是一个很好的建议，尽管听起来很吓人，但您可以向它提供由付款构成的邻接矩阵网络，然后是节点属性/特征.Python 的 DGL 库非常适合这个.如果你愿意，你可以在无监督的情况下使用缩放数据.)

(Another answer above mentions fusion based approaches to analyzing the data with both the node data and edge data together, but it does not appear as if that is your question. are you trying to use both datasets together? If so, the easiest approach is to use an approach with good implementations out there and 'graph neural networks' like the SGC, simple graph convolutional neural network, is a good recommendation, although it sounds intimidating, you'd feed it the adjacency matrix made from the payment network you make and then the node attributes/features. Python's DGL library is great for this. You can do it unsupervised with scaled data if you want.)

这篇关于R:K 均值聚类与社区检测算法(加权相关网络)-我是否将这个问题过于复杂?的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

更多推荐

[db:关键词]

本文发布于:2023-04-01 09:32:06，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/826494.html