查找数据集中所有点的距离最近点

编程入门行业动态更新时间:2024-10-27 05:28:02

本文介绍了查找数据集中所有点的距离最近点 - Python的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一个数据集如下，

Id Latitude longitude 1 25.42 55.47 2 25.39 55.47 3 24.48 54.38 4 24.51 54.54

我想为数据集的每个点找到最近的距离.我在网上找到了如下距离函数，

I want to find the nearest distance for every point for the dataset. I found the following distance function in the internet,

from math import radians, cos, sin, asin, sqrt def distance(lon1, lat1, lon2, lat2): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees) """ # convert decimal degrees to radians lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1 a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2 c = 2 * asin(sqrt(a)) km = 6367 * c return km

我正在使用以下功能，

shortest_distance = [] for i in range(1,len(data)): distance1 = [] for j in range(1,len(data)): distance1.append(distance(data['Longitude'][i], data['Latitude'][i], data['Longitude'][j], data['Latitude'][j])) shortest_distance.append(min(distance1))

但是这段代码为每个条目循环两次并返回 n^2 次迭代，反过来它非常慢.我的数据集包含近 100 万条记录，每次遍历所有元素两次都变得非常昂贵.

But this code is looping twice for each entry and return n^2 iterations and in turn it is very slow. My dataset contains nearly 1 million records and each time looping through all the elements twice is becoming very costly.

我想找到更好的方法来找出每一行的最近点.任何人都可以帮我找到一种在 python 中解决这个问题的方法吗?

I want to find the better way to find out the nearest point for each row. Can anybody help me in finding a way to solve this in python ?

谢谢

推荐答案

寻找 N 个点中最近的一个给定点的蛮力方法是 O(N) -- 你必须检查每个点.相反，如果 N 点存储在 KD-tree，然后找到最近的点平均是O(log(N)).构建 KD 树还有额外的一次性成本，这需要 O(N) 时间.

The brute force method of finding the nearest of N points to a given point is O(N) -- you'd have to check each point. In contrast, if the N points are stored in a KD-tree, then finding the nearest point is on average O(log(N)). There is also the additional one-time cost of building the KD-tree, which requires O(N) time.

如果需要重复这个过程N次，那么蛮力方法是O(N**2)，kd-tree方法是O(N*log(N)).因此，对于足够大的 N，KD 树将击败蛮力方法.

If you need to repeat this process N times, then the brute force method is O(N**2) and the kd-tree method is O(N*log(N)). Thus, for large enough N, the KD-tree will beat the brute force method.

请参阅此处了解有关最近邻算法的更多信息(包括 KD 树).

See here for more on nearest neighbor algorithms (including KD-tree).

下面(在函数 using_kdtree 中)是一种使用 scipy.spatial.kdtree.

Below (in the function using_kdtree) is a way to compute the great circle arclengths of nearest neighbors using scipy.spatial.kdtree.

scipy.spatial.kdtree 使用点之间的欧几里得距离，但有一个公式用于将球体上点之间的欧几里得弦距转换为大圆弧长(给定球体的半径).所以想法是将纬度/经度数据转换成笛卡尔坐标，使用KDTree来寻找最近的邻居，然后应用大圆距离公式获得想要的结果.

scipy.spatial.kdtree uses the Euclidean distance between points, but there is a formula for converting Euclidean chord distances between points on a sphere to great circle arclength (given the radius of the sphere). So the idea is to convert the latitude/longitude data into cartesian coordinates, use a KDTree to find the nearest neighbors, and then apply the great circle distance formula to obtain the desired result.

以下是一些基准.使用 N = 100，using_kdtree 比 orig(蛮力)方法快 39 倍.

Here are some benchmarks. Using N = 100, using_kdtree is 39x faster than the orig (brute force) method.

In [180]: %timeit using_kdtree(data) 100 loops, best of 3: 18.6 ms per loop In [181]: %timeit using_sklearn(data) 1 loop, best of 3: 214 ms per loop In [179]: %timeit orig(data) 1 loop, best of 3: 728 ms per loop

对于 N = 10000:

In [5]: %timeit using_kdtree(data) 1 loop, best of 3: 2.78 s per loop In [6]: %timeit using_sklearn(data) 1 loop, best of 3: 1min 15s per loop In [7]: %timeit orig(data) # untested; too slow

因为 using_kdtree 是 O(N log(N)) 而 orig 是 O(N**2)，因子为using_kdtree 比 orig 快的哪个会随着 N 增长，长度为数据，增长.

Since using_kdtree is O(N log(N)) and orig is O(N**2), the factor by which using_kdtree is faster than orig will grow as N, the length of data, grows.

import numpy as np import scipy.spatial as spatial import pandas as pd import sklearn.neighbors as neighbors from math import radians, cos, sin, asin, sqrt R = 6367 def using_kdtree(data): "Based on stackoverflow/q/43020919/190597" def dist_to_arclength(chord_length): """ en.wikipedia/wiki/Great-circle_distance Convert Euclidean chord length to great circle arc length """ central_angle = 2*np.arcsin(chord_length/(2.0*R)) arclength = R*central_angle return arclength phi = np.deg2rad(data['Latitude']) theta = np.deg2rad(data['Longitude']) data['x'] = R * np.cos(phi) * np.cos(theta) data['y'] = R * np.cos(phi) * np.sin(theta) data['z'] = R * np.sin(phi) tree = spatial.KDTree(data[['x', 'y','z']]) distance, index = tree.query(data[['x', 'y','z']], k=2) return dist_to_arclength(distance[:, 1]) def orig(data): def distance(lon1, lat1, lon2, lat2): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees) """ # convert decimal degrees to radians lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1 a = sin(dlat/2.0)**2 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2 c = 2 * asin(sqrt(a)) km = R * c return km shortest_distance = [] for i in range(len(data)): distance1 = [] for j in range(len(data)): if i == j: continue distance1.append(distance(data['Longitude'][i], data['Latitude'][i], data['Longitude'][j], data['Latitude'][j])) shortest_distance.append(min(distance1)) return shortest_distance def using_sklearn(data): """ Based on stackoverflow/a/45127250/190597 (Jonas Adler) """ def distance(p1, p2): """ Calculate the great circle distance between two points on the earth (specified in decimal degrees) """ lon1, lat1 = p1 lon2, lat2 = p2 # convert decimal degrees to radians lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2]) # haversine formula dlon = lon2 - lon1 dlat = lat2 - lat1 a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2 c = 2 * np.arcsin(np.sqrt(a)) km = R * c return km points = data[['Longitude', 'Latitude']] nbrs = neighbors.NearestNeighbors(n_neighbors=2, metric=distance).fit(points) distances, indices = nbrs.kneighbors(points) result = distances[:, 1] return result np.random.seed(2017) N = 1000 data = pd.DataFrame({'Latitude':np.random.uniform(-90,90,size=N), 'Longitude':np.random.uniform(0,360,size=N)}) expected = orig(data) for func in [using_kdtree, using_sklearn]: result = func(data) assert np.allclose(expected, result)

更多推荐

查找数据集中所有点的距离最近点

本文发布于:2023-10-13 14:29:52，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1488214.html