问题描述
限时送ChatGPT账号..我有一个包含两个邮政编码和相应纬度和经度的大型数据集(260 万行),我正在尝试计算它们之间的距离.我主要使用包 geosphere
来计算邮政编码之间的 Vincenty Ellipsoid 距离,但是我的数据集花费了大量时间.有什么可以快速实现的方法?
I have a large dataset (2.6M rows) with two zip codes and the corresponding latitudes and longitudes, and I am trying to compute the distance between them. I am primarily using the package geosphere
to calculate Vincenty Ellipsoid distance between the zip codes but it is taking a massive amount of time for my dataset. What can be a fast way to implement this?
我的尝试
library(tidyverse)
library(geosphere)
zipdata <- select(fulldata,originlat,originlong,destlat,destlong)
## Very basic approach
for(i in seq_len(nrow(zipdata))){
zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
c(zipdata$destlat[i],zipdata$destlong[i]),
fun=distVincentyEllipsoid)
}
## Tidyverse approach
zipdata <- zipdata%>%
mutate(dist2 = distm(cbind(originlat,originlong), cbind(destlat,destlong),
fun = distHaversine))
这两种方法都非常慢.我知道 210 万行永远不会是快速"计算,但我认为它可以做得更快.我在较小的测试数据上尝试了以下方法,但没有任何运气,
Both of these methods are extremely slow. I understand that 2.1M rows will never be a "fast" calculation, but I think it can be made faster. I have tried the following approach on a smaller test data without any luck,
library(doParallel)
cores <- 15
cl <- makeCluster(cores)
registerDoParallel(cl)
test <- select(head(fulldata,n=1000),originlat,originlong,destlat,destlong)
foreach(i = seq_len(nrow(test))) %dopar% {
library(geosphere)
zipdata$dist1[i] <- distm(c(zipdata$originlat[i],zipdata$originlong[i]),
c(zipdata$destlat[i],zipdata$destlong[i]),
fun=distVincentyEllipsoid)
}
stopCluster(cl)
谁能帮助我以正确的方式使用 doParallel
和 geosphere
或者更好的方法来处理这个问题?
Can anyone help me out with either the correct way to use doParallel
with geosphere
or a better way to handle this?
(部分)回复的基准
## benchmark
library(microbenchmark)
zipsamp <- sample_n(zip,size=1000000)
microbenchmark(
dave = {
# Dave2e
zipsamp$dist1 <- distHaversine(cbind(zipsamp$patlong,zipsamp$patlat),
cbind(zipsamp$faclong,zipsamp$faclat))
},
geohav = {
zipsamp$dist2 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "haversine")
},
geovin = {
zipsamp$dist3 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "vincenty")
},
geocheap = {
zipsamp$dist4 <- geodist(cbind(long=zipsamp$patlong,lat=zipsamp$patlat),
cbind(long=zipsamp$faclong,lat=zipsamp$faclat),
paired = T,measure = "cheap")
}
,unit = "s",times = 100)
# Unit: seconds
# expr min lq mean median uq max neval cld
# dave 0.28289613 0.32010753 0.36724810 0.32407858 0.32991396 2.52930556 100 d
# geohav 0.15820531 0.17053853 0.18271300 0.17307864 0.17531687 1.14478521 100 b
# geovin 0.23401878 0.24261274 0.26612401 0.24572869 0.24800670 1.26936889 100 c
# geocheap 0.01910599 0.03094614 0.03142404 0.03126502 0.03203542 0.03607961 100 a
一个简单的 all.equal
测试表明,对于我的数据集,haversine 方法等于 vincenty 方法,但与 geodist
包.
A simple all.equal
test showed that for my dataset the haversine method is equal to the vincenty method, but has a "Mean relative difference: 0.01002573" with the "cheap" method from the geodist
package.
推荐答案
R 是一种向量化语言,因此该函数将对向量中的所有元素进行操作.由于您正在计算每一行的原始和目的地之间的距离,因此不需要循环.矢量化方法大约是循环性能的 1000 倍.
此外,直接使用 distVincentyEllipsoid
(或 distHaveersine 等)并绕过 distm
函数也应该可以提高性能.
R is a vectorized language, thus the function will operate over all of the elements in the vectors. Since you are calculating the distance between the original and destination for each row, the loop is unnecessary. The vectorized approach is approximately 1000x the performance of the loop.
Also using the distVincentyEllipsoid
(or distHaveersine, etc. )directly and bypassing the distm
function should also improve the performance.
在没有任何示例数据的情况下,此代码段未经测试.
Without any sample data this snippet is untested.
library(geosphere)
zipdata <- select(fulldata,originlat,originlong,destlat,destlong)
## Very basic approach
zipdata$dist1 <- distVincentyEllipsoid(c(zipdata$originlong, zipdata$originlat),
c(zipdata$destlong, zipdata$destlat))
注意:为了使大多数地圈功能正常工作,正确的顺序是:先经度,然后是纬度.
上面列出的 tidyverse 方法缓慢的原因是 distm
函数正在计算每个起点和终点之间的距离,这将产生一个 200 万乘 200 万的元素矩阵.
The reason the tidyverse approach listed above is slow is the distm
function is calculating the distance between every origin and destination which would result in a 2 million by 2 million element matrix.
这篇关于如何使用doParallel计算R中邮政编码之间的距离?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!
更多推荐
[db:关键词]
发布评论