将来自两个数据帧的信息与dplyr组合在一起(Combine information from two data frames with dplyr)

我需要一些dplyr的帮助。我有两个数据框 - 一个是巨大的，有几个时间序列A,B,...在那里（ LargeDF ），第二个（ Categories ）有时间间隔（左边界和右边界）。

我想向LargeDF添加另一个列，标记为leftBoundary ，包含适当的边界值，如下所示：

LargeDF ts timestamp signal # left_boundary 1 A 0.3209338 10.43279 # 0 2 A 1.4791524 10.34295 # 1 3 A 2.6007494 10.71601 # 2

和

Categories ts left right 1 A 0 1 2 A 1 2 3 A 2 3

我想出的代码是

LargeDF %>% group_by(ts) %>% do(myFUN(., Categories)) # calls this ... myFUN <- function(Large, Categ) { CategTS <- Categ %>% filter(ts == Large[1, "ts"][[1]]) Large %>% group_by(timestamp) %>% # this is bothering me... mutate(left_boundary = CategTS$left[CategTS$left < timestamp & timestamp < CategTS$right]) }

但是对于大型时间序列来说它是超级慢的。我真的想失去group_by(timestamp) ，因为它们在每个ts都是独一无二的。

有人看到更好的解决方案吗？非常感谢。

# Code for making the example data frames ... library("dplyr") n <- 10; series <- c("A", "B", "C") LargeDF <- data.frame( ts = rep(series, each = n) , timestamp = runif(n*length(series), max = 4) , signal = runif(n*length(series), min = 10, max = 11) ) %>% group_by(ts) %>% arrange(timestamp) m <- 7 Categories <- data.frame( ts = rep(series, each = m) , left = rep(seq(1 : m) - 1, length(series)) , right = rep(seq(1 : m), length(series)) )

更新（data.table和我略微修改的模型）

所以，我首先尝试了来自@DavidArenburg的快速/脏模型示例的建议，但是遇到了一些时间戳被分箱两次（进入连续的类别/间隔）的问题。

> foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2")) left right value timestamp timestamp2 1: 0.9 1.9 0.1885459 1 1 2: 0.9 1.9 0.0542375 2 2 # binned here 3: 1.9 2.9 0.0542375 2 2 # and here as well 13: 19.9 25.9 0.4579986 20 20

然后我读到minoverlap = 1L作为默认值，并意识到正常的时间戳是>> 1 。

> as.numeric(Sys.time()) [1] 1429022267

因此，如果我将所有内容都移到更大的值（例如下面示例中的n <- 10 ），一切都很顺利。

left right value timestamp timestamp2 1: 9 19 0.64971126 10 10 2: 19 29 0.75994751 20 20 3: 29 99 0.98276462 30 30 9: 199 259 0.89816165 200 200

凭借我的真实数据，一切进展顺利，再次感谢。

## Code for my data.table example ----- n <- 1 d <- data.table( value = runif(9), timestamp = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n, timestamp2 = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n) c <- data.table(left = c(0.9, 1.9, 2.9, 9.9, 19.9, 25.9)*n, right = c(1.9, 2.9, 9.9, 19.9, 25.9, 33.9)*n) setkey(c, left, right) foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2"))

更新2（在dplyr中加入，然后是FILTER）

我测试了@aosmith的建议，使用dplyr函数left_join()创建一个（非常）大的DF，然后再次使用filter() 。很快，我遇到了内存问题：

Error: std::bad_alloc

对于较小的表，这种方法可能是一个好主意 - 因为语法非常好（但这又是个人偏好）。在这种情况下，我将使用data.table解决方案。再次感谢所有建议。

I need some help with dplyr. I have two data frames - one huge, with several time series A,B,... in there (LargeDF), and a second one (Categories) with time intervals (left and right boundaries).

I would like to add another column to LargeDF, labeled leftBoundary, containing the appropriate boundary value, like so:

LargeDF ts timestamp signal # left_boundary 1 A 0.3209338 10.43279 # 0 2 A 1.4791524 10.34295 # 1 3 A 2.6007494 10.71601 # 2

and

Categories ts left right 1 A 0 1 2 A 1 2 3 A 2 3

My code I came up with is

LargeDF %>% group_by(ts) %>% do(myFUN(., Categories)) # calls this ... myFUN <- function(Large, Categ) { CategTS <- Categ %>% filter(ts == Large[1, "ts"][[1]]) Large %>% group_by(timestamp) %>% # this is bothering me... mutate(left_boundary = CategTS$left[CategTS$left < timestamp & timestamp < CategTS$right]) }

but it is super slow for large time series. I would really like to lose the group_by(timestamp), as they are unique within each ts anyways.

Does someone see a better solution? That would be much appreciated.

# Code for making the example data frames ... library("dplyr") n <- 10; series <- c("A", "B", "C") LargeDF <- data.frame( ts = rep(series, each = n) , timestamp = runif(n*length(series), max = 4) , signal = runif(n*length(series), min = 10, max = 11) ) %>% group_by(ts) %>% arrange(timestamp) m <- 7 Categories <- data.frame( ts = rep(series, each = m) , left = rep(seq(1 : m) - 1, length(series)) , right = rep(seq(1 : m), length(series)) )

Update (data.table and my slightly modified mockup)

So, I tried the suggestions from @DavidArenburg on a quick/dirty mockup-example first, but had the problem that some timestamps were binned twice (into successive categories/intervals).

> foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2")) left right value timestamp timestamp2 1: 0.9 1.9 0.1885459 1 1 2: 0.9 1.9 0.0542375 2 2 # binned here 3: 1.9 2.9 0.0542375 2 2 # and here as well 13: 19.9 25.9 0.4579986 20 20

I then read about minoverlap = 1L as a default and realized that a normal timestamp is >> 1.

> as.numeric(Sys.time()) [1] 1429022267

Therefore, if I shifted everything to larger values (e.g. n <- 10 in the example below), everything went fine.

left right value timestamp timestamp2 1: 9 19 0.64971126 10 10 2: 19 29 0.75994751 20 20 3: 29 99 0.98276462 30 30 9: 199 259 0.89816165 200 200

With my real data, everything went smoothly, so thanks again.

## Code for my data.table example ----- n <- 1 d <- data.table( value = runif(9), timestamp = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n, timestamp2 = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n) c <- data.table(left = c(0.9, 1.9, 2.9, 9.9, 19.9, 25.9)*n, right = c(1.9, 2.9, 9.9, 19.9, 25.9, 33.9)*n) setkey(c, left, right) foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2"))

Update 2 (JOIN, then FILTER, within dplyr)

I tested the suggestion from @aosmith to use the dplyr function left_join() to create one (very) large DF, then filter() this again. Very quickly, I ran into memory issues:

Error: std::bad_alloc

Probably, this approach would be a good idea for smaller tables - as the syntax is very nice (but this, again, is personal preference). I'll go for the data.table solution in this case. Thanks again for all suggestions.

最满意答案

dplyr不适合这样的操作，请尝试使用data.table的foverlaps函数

library(data.table) class(LargeDF) <- "data.frame" ## Removing all the dplyr classes setDT(LargeDF)[, `:=`(left = timestamp, right = timestamp)] # creating min and max boundaries in the large table setkey(setDT(Categories)) # keying by all columns (necessary for `foverlaps` to work) LargeDF[, left_boundary := foverlaps(LargeDF, Categories)$left][] # Creating left_boundary # ts timestamp signal left right left_boundary # 1: A 0.46771516 10.72175 0.46771516 0.46771516 0 # 2: A 0.58841492 10.35459 0.58841492 0.58841492 0 # 3: A 1.14494484 10.50301 1.14494484 1.14494484 1 # 4: A 1.18298225 10.82431 1.18298225 1.18298225 1 # 5: A 1.69822678 10.04780 1.69822678 1.69822678 1 # 6: A 1.83189609 10.75001 1.83189609 1.83189609 1 # 7: A 1.90947475 10.94715 1.90947475 1.90947475 1 # 8: A 2.73305266 10.14449 2.73305266 2.73305266 2 # 9: A 3.02371968 10.17724 3.02371968 3.02371968 3 # ...

dplyr isn't suitable for such operations, try data.tables foverlaps functions instead

library(data.table) class(LargeDF) <- "data.frame" ## Removing all the dplyr classes setDT(LargeDF)[, `:=`(left = timestamp, right = timestamp)] # creating min and max boundaries in the large table setkey(setDT(Categories)) # keying by all columns (necessary for `foverlaps` to work) LargeDF[, left_boundary := foverlaps(LargeDF, Categories)$left][] # Creating left_boundary # ts timestamp signal left right left_boundary # 1: A 0.46771516 10.72175 0.46771516 0.46771516 0 # 2: A 0.58841492 10.35459 0.58841492 0.58841492 0 # 3: A 1.14494484 10.50301 1.14494484 1.14494484 1 # 4: A 1.18298225 10.82431 1.18298225 1.18298225 1 # 5: A 1.69822678 10.04780 1.69822678 1.69822678 1 # 6: A 1.83189609 10.75001 1.83189609 1.83189609 1 # 7: A 1.90947475 10.94715 1.90947475 1.90947475 1 # 8: A 2.73305266 10.14449 2.73305266 2.73305266 2 # 9: A 3.02371968 10.17724 3.02371968 3.02371968 3 # ...

更多推荐

将来自两个数据帧的信息与dplyr组合在一起(Combine information from two data frames with dplyr)

最满意答案

发布评论取消回复

最近发表

热门文章

标签列表