R即使使用mclapply,检查重复项也非常缓慢

编程入门 行业动态 更新时间:2024-10-11 03:16:54
本文介绍了R即使使用mclapply,检查重复项也非常缓慢的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我有一些涉及重复销售一堆具有独特ID的汽车的数据.一辆汽车可以卖不止一次.

I've got some data involving repeated sales for a bunch of of cars with unique Ids. A car can be sold more than once.

但是有些ID是错误的,因此我要检查每个ID的大小是否在多次销售中都记录为相同的大小.如果不是,那么我知道该ID是错误的.

Some of the Ids are erroneous however, so I'm checking, for each Id, if the size is recorded as the same over multiple sales. If it isn't, then I know that the Id is erroneous.

我正在尝试使用以下代码进行此操作:

I'm trying to do this with the following code:

library("doMC") Data <- data.frame(ID=c(15432,67325,34623,15432,67325,34623),Size=c("Big","Med","Small","Big","Med","Big")) compare <- function(v) all(sapply( as.list(v[-1]), FUN=function(z) {isTRUE(all.equal(z, v[1]))})) IsGoodId = function(Id){ Sub = Data[Data$ID==Id,] if (length(Sub[,1]) > 1){ return(compare(Sub[,"Size"])) }else{ return(TRUE) } } WhichAreGood = mclapply(unique(Data$ID),IsGoodId)

但是我的四核i5却非常痛苦,可怕,缓慢.

But it's painfully, awfully, terribly slow on my quad-core i5.

谁能看到瓶颈在哪里?我是R优化的新手.

Can anyone see where the bottleneck is? I'm a newbie to R optimisation.

谢谢, -N

推荐答案

看起来您的算法进行了N ^ 2个比较.也许像下面这样的东西会更好地扩展.我们发现重复的销售,认为这只是总数的一小部分.

Looks like your algorithm makes N^2 comparisons. Maybe something like the following will scale better. We find the duplicate sales, thinking that this is a small subset of the total.

dups = unique(Data$ID[duplicated(Data$ID)]) DupData = Data[Data$ID %in% dups,,drop=FALSE]

%in%运算符的伸缩性很好.然后根据ID分割尺寸列,检查ID的尺寸是否超过一个尺寸

The %in% operator scales very well. Then split the size column based on id, checking for id's with more than one size

tapply(DupData$Size, DupData$ID, function(x) length(unique(x)) != 1)

这将给出一个命名逻辑向量,其中TRUE表示每个id的大小超过一个.这与重复销售的数量大致成线性比例.有一些聪明的方法可以使此过程快速进行,因此,如果您重复的数据本身很大...

This gives a named logical vector, with TRUE indicating that there is more than one size per id. This scales approximately linearly with the number of duplicate sales; there are clever ways to make this go fast, so if your duplicated data is itself big...

嗯,我想再想一想

u = unique(Data) u$ID[duplicated(u$ID)]

起到了作用.

更多推荐

R即使使用mclapply,检查重复项也非常缓慢

本文发布于:2023-11-23 13:26:04,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1621606.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:缓慢   mclapply

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!