从两列映射对多对多关系进行分组

编程入门行业动态更新时间:2024-10-28 04:21:26

本文介绍了从两列映射对多对多关系进行分组的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一个SQL表，映射，例如，作者和书。我想将链接的作者和书（由同一作者写的书和合写一本书的作者）组合在一起，并确定这些组有多大。例如，如果J.K. Rowling与Junot Diaz合作，Junot Diaz与Zadie Smith合着了一本书，然后我希望所有三个作者都在同一个小组。

这里是一个玩具数据集（h / t Matthew Dowle）与我所说的一些关系：

set.seed（1） authors 这里可以看到作者27和36合写了第2册，在同一组。作者63和作者63的相同;和D，F和L for 4.等等。

我不能想到一个好的办法，除了一个for循环，你可以猜到）是慢的。我尝试了一些 data.table ，以避免不必要的复制。有更好的方法吗？

aubk $ group< - integer（dim（aubk）[1]） library（data.table） aubk< - data.table（aubk）＃system.time（{ for（x in 1：dim ）{ if（identical（x，1））{ value< - 1L } else { sb< - aubk [1：（x-1） ] index< - match（aubk [x，author_id]，sb [，author_id]） if（identical（index，NA_integer_））{ index< x，book_id]，sb [，book_id]） if（identical（index，NA_integer_））{ value< - x } else { value& [index，group] } } else { value< - aubk [index，group] } } aubk [x，group ：= value] } ＃}）

< 正如@Josh O'Brien和@thelatemail所提到的，我的问题也可以说是从两列列表中查找图的连接组件，其中每个边都是一行，两列是节点连接。

解决方案

将500K节点转换为邻接矩阵对于我的计算机内存太多，所以我不能使用 igraph 。 RBGL 包不会针对R版本2.15.1更新，因此也不会更新。

写了很多蠢的代码，似乎不工作，我想以下让我正确的答案。

aubk [，grp：= author_id] num.grp.old< - aubk [，length（unique（grp））] iterations < - 0 repeat { aubk [，grp：= min（grp），by = author_id] aubk [，grp：= min（grp），by = book_id] num.grp.new& unique（grp））] if（num.grp.new == num.grp.old）{break} num.grp.old< num.grp.new 迭代< - iterations + 1 }

I have a SQL table that maps, say, authors and books. I would like to group linked authors and books (books written by the same author, and authors who co-wrote a book) together and ascertain how big these groups get. For example, if J.K. Rowling co-wrote with Junot Diaz, and Junot Diaz co-wrote a book with Zadie Smith, then I would want all three authors in the same group.

Here's a toy data set (h/t Matthew Dowle) with some of the relationships I am talking about:

set.seed(1) authors <- replicate(100,sample(1:3,1)) book_id <- rep(1:100,times=authors) author_id <- c(lapply(authors,sample,x=1:100,replace=FALSE),recursive=TRUE) aubk <- data.table(author_id = author_id,book_id = book_id) aubk[order(book_id,author_id),]

Here one sees that authors 27 and 36 co-wrote book 2, so they should be in the same group. The same for authors 63 and 100 for 3; and D, F and L for 4. And so on.

I can't think of a good way to do this other than a for-loop, which (as you can guess) is slow. I tried a bit of data.table to avoid unnecessary copying. Is there a better way of doing it?

aubk$group <- integer(dim(aubk)[1]) library(data.table) aubk <- data.table(aubk) #system.time({ for (x in 1:dim(aubk)[1]) { if(identical(x,1)) { value <- 1L } else { sb <- aubk[1:(x-1),] index <- match(aubk[x,author_id],sb[,author_id]) if (identical(index,NA_integer_)) { index <- match(aubk[x,book_id],sb[,book_id]) if (identical(index,NA_integer_)) { value <- x } else { value <- aubk[index,group] } } else { value <- aubk[index,group] } } aubk[x,group:=value] } #})

EDIT: As mentioned by @Josh O'Brien and @thelatemail, my problem can also be worded as looking for the connected components of a graph from a two-column list where every edge is a row, and the two columns are the nodes connected.

解决方案

Converting 500K nodes into an adjacency matrix was too much for my computer's memory, so I couldn't use igraph. The RBGL package isn't updated for R version 2.15.1, so that was out as well.

After writing a lot of dumb code that doesn't seem to work, I think the following gets me to the right answer.

aubk[,grp := author_id] num.grp.old <- aubk[,length(unique(grp))] iterations <- 0 repeat { aubk[,grp := min(grp),by=author_id] aubk[,grp := min(grp), by=book_id] num.grp.new <- aubk[,length(unique(grp))] if(num.grp.new == num.grp.old) {break} num.grp.old <- num.grp.new iterations <- iterations + 1 }

更多推荐

从两列映射对多对多关系进行分组

本文发布于:2023-10-27 18:37:23，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1534201.html