R即使使用mclapply，检查重复项也非常缓慢

编程入门行业动态更新时间:2024-10-11 03:16:54

本文介绍了R即使使用mclapply，检查重复项也非常缓慢的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一些涉及重复销售一堆具有独特ID的汽车的数据.一辆汽车可以卖不止一次.

I've got some data involving repeated sales for a bunch of of cars with unique Ids. A car can be sold more than once.

但是有些ID是错误的，因此我要检查每个ID的大小是否在多次销售中都记录为相同的大小.如果不是，那么我知道该ID是错误的.

Some of the Ids are erroneous however, so I'm checking, for each Id, if the size is recorded as the same over multiple sales. If it isn't, then I know that the Id is erroneous.

我正在尝试使用以下代码进行此操作:

I'm trying to do this with the following code:

library("doMC") Data <- data.frame(ID=c(15432,67325,34623,15432,67325,34623),Size=c("Big","Med","Small","Big","Med","Big")) compare <- function(v) all(sapply( as.list(v[-1]), FUN=function(z) {isTRUE(all.equal(z, v[1]))})) IsGoodId = function(Id){ Sub = Data[Data$ID==Id,] if (length(Sub[,1]) > 1){ return(compare(Sub[,"Size"])) }else{ return(TRUE) } } WhichAreGood = mclapply(unique(Data$ID),IsGoodId)

但是我的四核i5却非常痛苦，可怕，缓慢.

But it's painfully, awfully, terribly slow on my quad-core i5.

谁能看到瓶颈在哪里?我是R优化的新手.

Can anyone see where the bottleneck is? I'm a newbie to R optimisation.

谢谢， -N

推荐答案

看起来您的算法进行了N ^ 2个比较.也许像下面这样的东西会更好地扩展.我们发现重复的销售，认为这只是总数的一小部分.

Looks like your algorithm makes N^2 comparisons. Maybe something like the following will scale better. We find the duplicate sales, thinking that this is a small subset of the total.

dups = unique(Data$ID[duplicated(Data$ID)]) DupData = Data[Data$ID %in% dups,,drop=FALSE]

%in%运算符的伸缩性很好.然后根据ID分割尺寸列，检查ID的尺寸是否超过一个尺寸

The %in% operator scales very well. Then split the size column based on id, checking for id's with more than one size

tapply(DupData$Size, DupData$ID, function(x) length(unique(x)) != 1)

这将给出一个命名逻辑向量，其中TRUE表示每个id的大小超过一个.这与重复销售的数量大致成线性比例.有一些聪明的方法可以使此过程快速进行，因此，如果您重复的数据本身很大...

This gives a named logical vector, with TRUE indicating that there is more than one size per id. This scales approximately linearly with the number of duplicate sales; there are clever ways to make this go fast, so if your duplicated data is itself big...

嗯，我想再想一想

u = unique(Data) u$ID[duplicated(u$ID)]

起到了作用.

更多推荐

R即使使用mclapply,检查重复项也非常缓慢

本文发布于:2023-11-23 13:26:04，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1621606.html