从两列映射对多对多关系进行分组

编程入门 行业动态 更新时间:2024-10-28 04:21:26
本文介绍了从两列映射对多对多关系进行分组的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有一个SQL表,映射,例如,作者和书。我想将链接的作者和书(由同一作者写的书和合写一本书的作者)组合在一起,并确定这些组有多大。例如,如果J.K. Rowling与Junot Diaz合作,Junot Diaz与Zadie Smith合着了一本书,然后我希望所有三个作者都在同一个小组。

这里是一个玩具数据集(h / t Matthew Dowle)与我所说的一些关系:

set.seed(1) authors 这里可以看到作者27和36合写了第2册,在同一组。作者63和作者63的相同;和D,F和L for 4.等等。

我不能想到一个好的办法,除了一个for循环,你可以猜到)是慢的。我尝试了一些 data.table ,以避免不必要的复制。有更好的方法吗?

aubk $ group< - integer(dim(aubk)[1]) library(data.table) aubk< - data.table(aubk)#system.time({ for(x in 1:dim ){ if(identical(x,1)){ value< - 1L } else { sb< - aubk [1:(x-1) ] index< - match(aubk [x,author_id],sb [,author_id]) if(identical(index,NA_integer_)){ index< x,book_id],sb [,book_id]) if(identical(index,NA_integer_)){ value< - x } else { value& [index,group] } } else { value< - aubk [index,group] } } aubk [x,group := value] } #})

< 正如@Josh O'Brien和@thelatemail所提到的,我的问题也可以说是从两列列表中查找图的连接组件,其中每个边都是一行,两列是节点连接。

解决方案

将500K节点转换为邻接矩阵对于我的计算机内存太多,所以我不能使用 igraph 。 RBGL 包不会针对R版本2.15.1更新,因此也不会更新。

写了很多蠢的代码,似乎不工作,我想以下让我正确的答案。

aubk [,grp:= author_id] num.grp.old< - aubk [,length(unique(grp))] iterations < - 0 repeat { aubk [,grp:= min(grp),by = author_id] aubk [,grp:= min(grp),by = book_id] num.grp.new& unique(grp))] if(num.grp.new == num.grp.old){break} num.grp.old< num.grp.new 迭代< - iterations + 1 }

I have a SQL table that maps, say, authors and books. I would like to group linked authors and books (books written by the same author, and authors who co-wrote a book) together and ascertain how big these groups get. For example, if J.K. Rowling co-wrote with Junot Diaz, and Junot Diaz co-wrote a book with Zadie Smith, then I would want all three authors in the same group.

Here's a toy data set (h/t Matthew Dowle) with some of the relationships I am talking about:

set.seed(1) authors <- replicate(100,sample(1:3,1)) book_id <- rep(1:100,times=authors) author_id <- c(lapply(authors,sample,x=1:100,replace=FALSE),recursive=TRUE) aubk <- data.table(author_id = author_id,book_id = book_id) aubk[order(book_id,author_id),]

Here one sees that authors 27 and 36 co-wrote book 2, so they should be in the same group. The same for authors 63 and 100 for 3; and D, F and L for 4. And so on.

I can't think of a good way to do this other than a for-loop, which (as you can guess) is slow. I tried a bit of data.table to avoid unnecessary copying. Is there a better way of doing it?

aubk$group <- integer(dim(aubk)[1]) library(data.table) aubk <- data.table(aubk) #system.time({ for (x in 1:dim(aubk)[1]) { if(identical(x,1)) { value <- 1L } else { sb <- aubk[1:(x-1),] index <- match(aubk[x,author_id],sb[,author_id]) if (identical(index,NA_integer_)) { index <- match(aubk[x,book_id],sb[,book_id]) if (identical(index,NA_integer_)) { value <- x } else { value <- aubk[index,group] } } else { value <- aubk[index,group] } } aubk[x,group:=value] } #})

EDIT: As mentioned by @Josh O'Brien and @thelatemail, my problem can also be worded as looking for the connected components of a graph from a two-column list where every edge is a row, and the two columns are the nodes connected.

解决方案

Converting 500K nodes into an adjacency matrix was too much for my computer's memory, so I couldn't use igraph. The RBGL package isn't updated for R version 2.15.1, so that was out as well.

After writing a lot of dumb code that doesn't seem to work, I think the following gets me to the right answer.

aubk[,grp := author_id] num.grp.old <- aubk[,length(unique(grp))] iterations <- 0 repeat { aubk[,grp := min(grp),by=author_id] aubk[,grp := min(grp), by=book_id] num.grp.new <- aubk[,length(unique(grp))] if(num.grp.new == num.grp.old) {break} num.grp.old <- num.grp.new iterations <- iterations + 1 }

更多推荐

从两列映射对多对多关系进行分组

本文发布于:2023-10-27 18:37:23,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1534201.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:关系

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!