将来自两个数据帧的信息与dplyr组合在一起(Combine information from two data frames with dplyr)

编程入门 行业动态 更新时间:2024-10-24 04:44:47
将来自两个数据帧的信息与dplyr组合在一起(Combine information from two data frames with dplyr)

我需要一些dplyr的帮助。 我有两个数据框 - 一个是巨大的,有几个时间序列A,B,...在那里( LargeDF ),第二个( Categories )有时间间隔(左边界和右边界)。

我想向LargeDF添加另一个列,标记为leftBoundary ,包含适当的边界值,如下所示:

LargeDF ts timestamp signal # left_boundary 1 A 0.3209338 10.43279 # 0 2 A 1.4791524 10.34295 # 1 3 A 2.6007494 10.71601 # 2

Categories ts left right 1 A 0 1 2 A 1 2 3 A 2 3

我想出的代码是

LargeDF %>% group_by(ts) %>% do(myFUN(., Categories)) # calls this ... myFUN <- function(Large, Categ) { CategTS <- Categ %>% filter(ts == Large[1, "ts"][[1]]) Large %>% group_by(timestamp) %>% # this is bothering me... mutate(left_boundary = CategTS$left[CategTS$left < timestamp & timestamp < CategTS$right]) }

但是对于大型时间序列来说它是超级慢的。 我真的想失去group_by(timestamp) ,因为它们在每个ts都是独一无二的。

有人看到更好的解决方案吗? 非常感谢。

# Code for making the example data frames ... library("dplyr") n <- 10; series <- c("A", "B", "C") LargeDF <- data.frame( ts = rep(series, each = n) , timestamp = runif(n*length(series), max = 4) , signal = runif(n*length(series), min = 10, max = 11) ) %>% group_by(ts) %>% arrange(timestamp) m <- 7 Categories <- data.frame( ts = rep(series, each = m) , left = rep(seq(1 : m) - 1, length(series)) , right = rep(seq(1 : m), length(series)) )

更新(data.table和我略微修改的模型)

所以,我首先尝试了来自@DavidArenburg的快速/脏模型示例的建议,但是遇到了一些时间戳被分箱两次(进入连续的类别/间隔)的问题。

> foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2")) left right value timestamp timestamp2 1: 0.9 1.9 0.1885459 1 1 2: 0.9 1.9 0.0542375 2 2 # binned here 3: 1.9 2.9 0.0542375 2 2 # and here as well 13: 19.9 25.9 0.4579986 20 20

然后我读到minoverlap = 1L作为默认值,并意识到正常的时间戳是>> 1 。

> as.numeric(Sys.time()) [1] 1429022267

因此,如果我将所有内容都移到更大的值(例如下面示例中的n <- 10 ),一切都很顺利。

left right value timestamp timestamp2 1: 9 19 0.64971126 10 10 2: 19 29 0.75994751 20 20 3: 29 99 0.98276462 30 30 9: 199 259 0.89816165 200 200

凭借我的真实数据,一切进展顺利,再次感谢。

## Code for my data.table example ----- n <- 1 d <- data.table( value = runif(9), timestamp = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n, timestamp2 = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n) c <- data.table(left = c(0.9, 1.9, 2.9, 9.9, 19.9, 25.9)*n, right = c(1.9, 2.9, 9.9, 19.9, 25.9, 33.9)*n) setkey(c, left, right) foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2"))

更新2(在dplyr中加入,然后是FILTER)

我测试了@aosmith的建议,使用dplyr函数left_join()创建一个(非常)大的DF,然后再次使用filter() 。 很快,我遇到了内存问题:

Error: std::bad_alloc

对于较小的表,这种方法可能是一个好主意 - 因为语法非常好(但这又是个人偏好)。 在这种情况下,我将使用data.table解决方案。 再次感谢所有建议。

I need some help with dplyr. I have two data frames - one huge, with several time series A,B,... in there (LargeDF), and a second one (Categories) with time intervals (left and right boundaries).

I would like to add another column to LargeDF, labeled leftBoundary, containing the appropriate boundary value, like so:

LargeDF ts timestamp signal # left_boundary 1 A 0.3209338 10.43279 # 0 2 A 1.4791524 10.34295 # 1 3 A 2.6007494 10.71601 # 2

and

Categories ts left right 1 A 0 1 2 A 1 2 3 A 2 3

My code I came up with is

LargeDF %>% group_by(ts) %>% do(myFUN(., Categories)) # calls this ... myFUN <- function(Large, Categ) { CategTS <- Categ %>% filter(ts == Large[1, "ts"][[1]]) Large %>% group_by(timestamp) %>% # this is bothering me... mutate(left_boundary = CategTS$left[CategTS$left < timestamp & timestamp < CategTS$right]) }

but it is super slow for large time series. I would really like to lose the group_by(timestamp), as they are unique within each ts anyways.

Does someone see a better solution? That would be much appreciated.

# Code for making the example data frames ... library("dplyr") n <- 10; series <- c("A", "B", "C") LargeDF <- data.frame( ts = rep(series, each = n) , timestamp = runif(n*length(series), max = 4) , signal = runif(n*length(series), min = 10, max = 11) ) %>% group_by(ts) %>% arrange(timestamp) m <- 7 Categories <- data.frame( ts = rep(series, each = m) , left = rep(seq(1 : m) - 1, length(series)) , right = rep(seq(1 : m), length(series)) )

Update (data.table and my slightly modified mockup)

So, I tried the suggestions from @DavidArenburg on a quick/dirty mockup-example first, but had the problem that some timestamps were binned twice (into successive categories/intervals).

> foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2")) left right value timestamp timestamp2 1: 0.9 1.9 0.1885459 1 1 2: 0.9 1.9 0.0542375 2 2 # binned here 3: 1.9 2.9 0.0542375 2 2 # and here as well 13: 19.9 25.9 0.4579986 20 20

I then read about minoverlap = 1L as a default and realized that a normal timestamp is >> 1.

> as.numeric(Sys.time()) [1] 1429022267

Therefore, if I shifted everything to larger values (e.g. n <- 10 in the example below), everything went fine.

left right value timestamp timestamp2 1: 9 19 0.64971126 10 10 2: 19 29 0.75994751 20 20 3: 29 99 0.98276462 30 30 9: 199 259 0.89816165 200 200

With my real data, everything went smoothly, so thanks again.

## Code for my data.table example ----- n <- 1 d <- data.table( value = runif(9), timestamp = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n, timestamp2 = c(1, 2, 3, 5, 7, 10, 15, 18, 20)*n) c <- data.table(left = c(0.9, 1.9, 2.9, 9.9, 19.9, 25.9)*n, right = c(1.9, 2.9, 9.9, 19.9, 25.9, 33.9)*n) setkey(c, left, right) foverlaps(d, c, type="any", by.x = c("timestamp", "timestamp2"))

Update 2 (JOIN, then FILTER, within dplyr)

I tested the suggestion from @aosmith to use the dplyr function left_join() to create one (very) large DF, then filter() this again. Very quickly, I ran into memory issues:

Error: std::bad_alloc

Probably, this approach would be a good idea for smaller tables - as the syntax is very nice (but this, again, is personal preference). I'll go for the data.table solution in this case. Thanks again for all suggestions.

最满意答案

dplyr不适合这样的操作,请尝试使用data.table的foverlaps函数

library(data.table) class(LargeDF) <- "data.frame" ## Removing all the dplyr classes setDT(LargeDF)[, `:=`(left = timestamp, right = timestamp)] # creating min and max boundaries in the large table setkey(setDT(Categories)) # keying by all columns (necessary for `foverlaps` to work) LargeDF[, left_boundary := foverlaps(LargeDF, Categories)$left][] # Creating left_boundary # ts timestamp signal left right left_boundary # 1: A 0.46771516 10.72175 0.46771516 0.46771516 0 # 2: A 0.58841492 10.35459 0.58841492 0.58841492 0 # 3: A 1.14494484 10.50301 1.14494484 1.14494484 1 # 4: A 1.18298225 10.82431 1.18298225 1.18298225 1 # 5: A 1.69822678 10.04780 1.69822678 1.69822678 1 # 6: A 1.83189609 10.75001 1.83189609 1.83189609 1 # 7: A 1.90947475 10.94715 1.90947475 1.90947475 1 # 8: A 2.73305266 10.14449 2.73305266 2.73305266 2 # 9: A 3.02371968 10.17724 3.02371968 3.02371968 3 # ...

dplyr isn't suitable for such operations, try data.tables foverlaps functions instead

library(data.table) class(LargeDF) <- "data.frame" ## Removing all the dplyr classes setDT(LargeDF)[, `:=`(left = timestamp, right = timestamp)] # creating min and max boundaries in the large table setkey(setDT(Categories)) # keying by all columns (necessary for `foverlaps` to work) LargeDF[, left_boundary := foverlaps(LargeDF, Categories)$left][] # Creating left_boundary # ts timestamp signal left right left_boundary # 1: A 0.46771516 10.72175 0.46771516 0.46771516 0 # 2: A 0.58841492 10.35459 0.58841492 0.58841492 0 # 3: A 1.14494484 10.50301 1.14494484 1.14494484 1 # 4: A 1.18298225 10.82431 1.18298225 1.18298225 1 # 5: A 1.69822678 10.04780 1.69822678 1.69822678 1 # 6: A 1.83189609 10.75001 1.83189609 1.83189609 1 # 7: A 1.90947475 10.94715 1.90947475 1.90947475 1 # 8: A 2.73305266 10.14449 2.73305266 2.73305266 2 # 9: A 3.02371968 10.17724 3.02371968 3.02371968 3 # ...

更多推荐

本文发布于:2023-08-07 23:48:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1466416.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:组合   两个   数据   信息   frames

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!