需要使用一组坐标快速对非常大的ncdf进行分类(Need to quickly subset a very big ncdf using a set of coordinates)

编程入门行业动态更新时间:2024-10-20 07:59:20

我有一个netcdf文件，其中包含一个浮点数组（21600,4300）。我不想将整个数组读到RAM中，因为它太大了，所以我使用netCDF4库中的Dataset对象来读取数组。

我想用300-400个坐标的两个1D numpy数组（x_coords，y_coords）来计算这个数组的子集的均值。

我不认为我可以使用基本索引，因为我拥有的坐标不是连续的。我目前正在做的只是将数组直接馈入对象，如下所示：

ncdf_data = Dataset(file, 'r') mean = np.mean(ncdf_data.variables['q'][x_coords, y_coords])

上面的代码需要太长的时间才能满足我的需要（取决于我使用的坐标，大概需要3-4秒），并且我想以某种方式加快速度。有没有一种pythonic方式可以用来直接从这样的子集中找出平均值而不触发奇特的索引？

I have a netcdf file which contains a float array (21600, 43200). I don't want to read in the entire array to RAM because it's too large, so I'm using the Dataset object from the netCDF4 library to read in the array.

I would like to calculate the mean of a subset this array using two 1D numpy arrays (x_coords, y_coords) of 300-400 coordinates.

I don't think I can use basic indexing, because the coordinates I have aren't continuous. What I'm currently doing is just feeding the arrays directly into the object, like so:

ncdf_data = Dataset(file, 'r') mean = np.mean(ncdf_data.variables['q'][x_coords, y_coords])

The above code takes far too long for my liking (~3-4 seconds depending on the coordinates I'm using), and I'd like to speed this up somehow. Is there a pythonic way that I can use to directly work out the mean from such a subset without triggering fancy indexing?

最满意答案

我知道h5py警告说花式索引速度慢，

docs.h5py.org/en/latest/high/dataset.html#fancy-indexing.

netcdf可能有同样的问题。

您可以加载包含所有值的连续片，并将更快的numpy高级索引应用于该子集吗？或者你可能不得不与大块工作。

numpy高级索引比基本的分片要慢，但它比直接从文件中索引的花式索引要快得多。

然而，你这样做， np.mean将对内存中的数据进行操作，而不是直接对文件中的数据进行操作。花哨索引的缓慢是因为它必须访问分散在文件中的数据。将数据加载到内存中的数组并不是缓慢的部分。缓慢的部分是从文件中寻找和阅读。

将文件放在更快的驱动器上（例如固态驱动器）可能会有所帮助。

I know h5py warns about the slow speed of fancy indexing,

docs.h5py.org/en/latest/high/dataset.html#fancy-indexing.

netcdf probably has the same problem.

Can you load contiguous slice that contains all values, and apply the faster numpy advanced indexing to that subset? Or you may have to work with chunks.

numpy advanced indexing is slower than it's basic slicing, but that is still quite a bit faster than the fancy indexing directly off the file.

However you do it, np.mean will be operating on data in memory, not directly on data in the file. The slowness of fancy indexing is because it has to access data scattered through out the file. Loading the data into an array in memory isn't the slow part. The slow part is seeking and reading from the file.

Putting the file on a faster drive (e.g. a solid state one) might help.

更多推荐

本文发布于:2023-08-01 20:18:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1363697.html