我有一些涉及重复销售一堆具有独特ID的汽车的数据.一辆汽车可以卖不止一次.
I've got some data involving repeated sales for a bunch of of cars with unique Ids. A car can be sold more than once.
但是有些ID是错误的,因此我要检查每个ID的大小是否在多次销售中都记录为相同的大小.如果不是,那么我知道该ID是错误的.
Some of the Ids are erroneous however, so I'm checking, for each Id, if the size is recorded as the same over multiple sales. If it isn't, then I know that the Id is erroneous.
我正在尝试使用以下代码进行此操作:
I'm trying to do this with the following code:
library("doMC") Data <- data.frame(ID=c(15432,67325,34623,15432,67325,34623),Size=c("Big","Med","Small","Big","Med","Big")) compare <- function(v) all(sapply( as.list(v[-1]), FUN=function(z) {isTRUE(all.equal(z, v[1]))})) IsGoodId = function(Id){ Sub = Data[Data$ID==Id,] if (length(Sub[,1]) > 1){ return(compare(Sub[,"Size"])) }else{ return(TRUE) } } WhichAreGood = mclapply(unique(Data$ID),IsGoodId)但是我的四核i5却非常痛苦,可怕,缓慢.
But it's painfully, awfully, terribly slow on my quad-core i5.
谁能看到瓶颈在哪里?我是R优化的新手.
Can anyone see where the bottleneck is? I'm a newbie to R optimisation.
谢谢, -N
推荐答案看起来您的算法进行了N ^ 2个比较.也许像下面这样的东西会更好地扩展.我们发现重复的销售,认为这只是总数的一小部分.
Looks like your algorithm makes N^2 comparisons. Maybe something like the following will scale better. We find the duplicate sales, thinking that this is a small subset of the total.
dups = unique(Data$ID[duplicated(Data$ID)]) DupData = Data[Data$ID %in% dups,,drop=FALSE]%in%运算符的伸缩性很好.然后根据ID分割尺寸列,检查ID的尺寸是否超过一个尺寸
The %in% operator scales very well. Then split the size column based on id, checking for id's with more than one size
tapply(DupData$Size, DupData$ID, function(x) length(unique(x)) != 1)这将给出一个命名逻辑向量,其中TRUE表示每个id的大小超过一个.这与重复销售的数量大致成线性比例.有一些聪明的方法可以使此过程快速进行,因此,如果您重复的数据本身很大...
This gives a named logical vector, with TRUE indicating that there is more than one size per id. This scales approximately linearly with the number of duplicate sales; there are clever ways to make this go fast, so if your duplicated data is itself big...
嗯,我想再想一想
u = unique(Data) u$ID[duplicated(u$ID)]起到了作用.
更多推荐
R即使使用mclapply,检查重复项也非常缓慢
发布评论