用于Dask数组和/或h5py的循环(For loops with Dask arrays and/or h5py)

编程入门 行业动态 更新时间:2024-10-28 00:17:50
用于Dask数组和/或h5py的循环(For loops with Dask arrays and/or h5py)

我有超过一亿行数据的时间序列。 我正试图重塑它,以包括一个时间窗口。 我的样本数据具有形状(79499,9),我正试图将其重塑为(79979,10,9)。 以下for循环在numpy中工作正常。

def munge(data, backprop_window): result = [] for index in range(len(data) - backprop_window): result.append(data[index: index + backprop_window]) return np.array(result) X_train = munge(X_train, backprop_window)

我已经用dask尝试了一些变体,但是它们都似乎没有提供任何错误消息,包括这一个:

import h5py import dask.array as da f1 = h5py.File("data.hdf5") X_train = f1.create_dataset('X_train',data = X_train, dtype='float32') x = da.from_array(X_train, chunks=(10000, d.shape[1])) result = x.compute(munge(x, backprop_window))

任何明智的想法表示赞赏

I have a time series with over a hundred million rows of data. I am trying to reshape it to include a time window. My sample data is of shape (79499, 9) and I am trying to reshape it to (79979, 10, 9). The following for loop works fine in numpy.

def munge(data, backprop_window): result = [] for index in range(len(data) - backprop_window): result.append(data[index: index + backprop_window]) return np.array(result) X_train = munge(X_train, backprop_window)

I have tried a few variations with dask, but all of them seem to hang without giving any error messages, including this one:

import h5py import dask.array as da f1 = h5py.File("data.hdf5") X_train = f1.create_dataset('X_train',data = X_train, dtype='float32') x = da.from_array(X_train, chunks=(10000, d.shape[1])) result = x.compute(munge(x, backprop_window))

Any wise thoughts appreciated.

最满意答案

这并不一定能解决你的dask问题,但是作为munge一个更快的选择,你可以用numpy的stride_tricks来创建一个滚动视图到你的数据中( 这里以示例为例)。

def munge_strides(data, backprop_window): """ take a rolling view into array by manipulating strides """ from numpy.lib.stride_tricks import as_strided new_shape = (data.shape[0] - backprop_window, backprop_window, data.shape[1]) new_strides = (data.strides[0], data.strides[0], data.strides[1]) return as_strided(data, shape=new_shape, strides=new_strides) X_train = np.arange(100).reshape(20, 5) np.array_equal(munge(X_train, backprop_window=3), munge_strides(X_train, backprop_window=3)) Out[112]: True

as_strided需要非常仔细地使用 - 这是一个'高级'功能和不正确的参数可以很容易地引导你进入segfaults - 请参阅文档字符串

This doesn't necessarily solve your dask issue, but as a much faster alternative to munge, you could instead use numpy's stride_tricks to create a rolling view into your data (based on example here).

def munge_strides(data, backprop_window): """ take a rolling view into array by manipulating strides """ from numpy.lib.stride_tricks import as_strided new_shape = (data.shape[0] - backprop_window, backprop_window, data.shape[1]) new_strides = (data.strides[0], data.strides[0], data.strides[1]) return as_strided(data, shape=new_shape, strides=new_strides) X_train = np.arange(100).reshape(20, 5) np.array_equal(munge(X_train, backprop_window=3), munge_strides(X_train, backprop_window=3)) Out[112]: True

as_strided needs to be used very carefully - it is an 'advanced' feature and incorrect parameters can easily lead you into segfaults - see docstring

更多推荐

本文发布于:2023-07-09 14:52:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1087124.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:数组   Dask   h5py   arrays   loops

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!