提高R中的效率循环以进行日期比较和数据集创建(Improve efficiency loops in R for date comparison and dataset creation)

编程入门行业动态更新时间:2024-10-28 00:21:31

我有一个名为DateTime的数据集，其中包含一个包含ID的列，一个包含其访问开始日期的列以及一个包含其访问结束日期的列。我想创建一个包含两列的数据集，其中第一列给出日期和时间，第二列给出存在的ID。因此，如果在特定日期的某个小时出现两个ID，则会创建两行。为此，我创建了数据框Presence以将这些存储到列，并使日期列具有正确的格式。我还有一个向量日期，其中包含第一个开始日期和最后结束日期之间的所有可能日期和小时。

我创建了第一个for循环来检查每秒ID for循环以检查每个日期，如果日期之间有重叠，则数据存储在Presence中。但是，我必须让它在包含60 000个ID和11,000个可能日期的数据集上运行。现在已经运行了4个多小时。这并不让我感到惊讶，但必须有更快的方法来实现这一点。

Presence=data.frame(matrix(vector(), 5000000, 2), stringsAsFactors = FALSE) Presence<- data.frame(Date= Presence[,1], ID= Presence[,2]) Presence$Date<-as.POSIXct(strptime(Presence$Date, format="%Y-%m-%d %H:%M:%S"), tz = "Europe/Brussels") k=1 for (i in 1:length(DateTime$ID)){ for (j in 1:length(Dates)){ if ((DateTime$START_DATE[i]<Dates[j]) & (DateTime$END_DATE[i]>Dates[j]) ){ Presence$Date[k]<-as.POSIXct(strptime(Dates[j], "%Y-%m-%d %H:%M:%S"), tz = "Europe/Brussels") Presence$ID[k]<-DateTime$ID[i] k=k+1} } }

有人可以帮我弄这个吗？我不是R专家所以我可能会不必要地解决这个问题。谢谢！

I have a dataset named DateTime containing a column with ID's, a column with the start date of their visit and a column with the end date of their visit. I want to create a dataset with two column where the first one gives the date and hour of the day and the second one gives the ID that is present. So if two ID's are present at a certain hour of a certain date, this will create two lines. To do this I created the data frame Presence to store these to columns and made the date column of the right format. I also have a vector Dates containing all the possible dates and hours between the first start date and last end date.

I created a first for loop to check every ID over a second for loop to check every date and if there is an overlap between the dates, the data is stored in Presence. However, I have to let this run over a dataset containing 60 000 ID's and 11 000 possible dates with hour. It has now been running for over 4 hours. This doesn't really surprise me, but there must be a faster way to implement this.

Can someone help me with this? I'm no R expert so I might be unnecessarily going around the problem too much. Thanks!

最满意答案

您尝试执行的操作称为重叠连接， data.table :: foverlaps函数是R中的一个高效实现。以下应该生成您想要的内容：

library(data.table) UniqueDates <- unique(c(DateTime$START_DATE, DateTime$END_DATE)) Dates <- Dates[order(Dates)] Dates <- data.frame(Date = UniqueDates, Date1 = UniqueDates, Date2 = UniqueDates) Dates <- setDT(Dates, key = c("Date", "Dates1", "Dates2")) DateTime <- setDT(DateTime, key=c("id", "START_DATE", "END_DATE")) Presence <- foverlaps(Dates, DateTime, type = "within", mult = "all", nomatch = 0) setDF(Presence) Presence <- Presence[, c("Date", "id")]

您可能需要修改输入日期向量以满足您的需要。除非您的可用内存允许，否则您可能必须在输入data.frame的子集上使用上述内容，然后将结果合并。

The operation you are attempting to perform is known as an overlap join, for which the data.table::foverlaps function is an efficient implementation in R. The following should produce what you want:

You will likely need to modify the input date vector to suit your needs. Unless your available memory permits it, you may have to use the above on subsets of your input data.frame, and combine the results afterwards.

更多推荐

本文发布于:2023-07-31 13:45:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1344961.html