可以memmap熊猫系列。(Can memmap pandas series. What about a dataframe?)

似乎我可以通过创建一个mmap'd ndarray并使用它初始化该系列来为蟒蛇系列记忆底层数据。

def assert_readonly(iloc): try: iloc[0] = 999 # Should be non-editable raise Exception("MUST BE READ ONLY (1)") except ValueError as e: assert "read-only" in e.message # Original ndarray n = 1000 _arr = np.arange(0,1000, dtype=float) # Convert it to a memmap mm = np.memmap(filename, mode='w+', shape=_arr.shape, dtype=_arr.dtype) mm[:] = _arr[:] del _arr mm.flush() mm.flags['WRITEABLE'] = False # Make immutable! # Wrap as a series s = pd.Series(mm, name="a") assert_readonly(s.iloc)

成功！似乎s由只读mem映射的ndarray支持。我可以为DataFrame执行相同的操作吗？以下失败

df = pd.DataFrame(s, copy=False, columns=['a']) assert_readonly(df["a"]) # Fails

以下成功，但仅适用于一列：

df = pd.DataFrame(mm.reshape(len(mm,1)), columns=['a'], copy=False) assert_readonly(df["a"]) # Succeeds

...所以我可以制作一个DF而不需要复制。但是，这只适用于一列，我想要很多。我已经找到了组合1列DF的方法：pd.concat（.. copy = False），pd.merge（copy = False），...产生副本。

我有成千上万的大型数据文件作为数据文件，其中我一次只需要几个。我希望我能够将他们的mmap'd表示放在DataFrame中，如上所述。可能吗？

熊猫文档使人们难以猜测引擎盖下发生了什么 - 虽然它确实说了一个DataFrame “可以被认为是Series对象的类字典容器。” 。我开始这已经不再是这样了。

我宁愿不需要HD5来解决这个问题。

It seems that I can memmap the underlying data for a python series by creating a mmap'd ndarray and using it to initialize the Series.

def assert_readonly(iloc): try: iloc[0] = 999 # Should be non-editable raise Exception("MUST BE READ ONLY (1)") except ValueError as e: assert "read-only" in e.message # Original ndarray n = 1000 _arr = np.arange(0,1000, dtype=float) # Convert it to a memmap mm = np.memmap(filename, mode='w+', shape=_arr.shape, dtype=_arr.dtype) mm[:] = _arr[:] del _arr mm.flush() mm.flags['WRITEABLE'] = False # Make immutable! # Wrap as a series s = pd.Series(mm, name="a") assert_readonly(s.iloc)

Success! Its seems that s is backed by a read-only mem-mapped ndarray. Can I do the same for a DataFrame? The following fails

df = pd.DataFrame(s, copy=False, columns=['a']) assert_readonly(df["a"]) # Fails

The following succeeds, but only for one column:

df = pd.DataFrame(mm.reshape(len(mm,1)), columns=['a'], copy=False) assert_readonly(df["a"]) # Succeeds

... so I can make a DF without copying. However, this only works for one column, and I want many. Method I've found for combining 1-column DFs: pd.concat(..copy=False), pd.merge(copy=False), ... result in copies.

I have some thousands of large columns as datafiles, of which I only ever need a few at a time. I was hoping I'd be able to place their mmap'd representations in a DataFrame as above. Is it possible?

Pandas documentation makes it a little difficult to guess about what's going on under the hood here - although it does say a DataFrame "Can be thought of as a dict-like container for Series objects.". I'm beginning to this this is no longer the case.

I'd prefer not to need HD5 to solve this.

最满意答案

如果您更改DataFrame构造函数以添加参数copy = False，则会出现所需的行为。 https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html

编辑：另外，你想使用底层的ndarray（而不是熊猫系列）。

OK ... after a lot of digging here's what's going on. Pandas' DataFrame uses the BlockManager class to organize the data internally. Contrary to the docs, DataFrame is NOT a collection of series but a collection of similarly dtyped matrices. BlockManger groups all the float columns together, all the int columns together, etc..., and their memory (from what I can tell) is kept together.

It can do that without copying the memory ONLY if a single ndarray matrix (a single type) is provided. Note, BlockManager (in theory) also supports not-copying mixed type data in its construction as it may not be necessary to copy this input into same-typed chunked. However, the DataFrame constructor doesn't make a copy ONLY if a single matrix is the data parameter.

In short, if you have mixed types or multiple arrays as input to the constructor, or a provide a dict with a single array, you are out of luck in Pandas, and DataFrame's default BlockManager will copy your data.

In any case, one way to work around this is to force BlockManager to not consolidate-by-type, but to keep each column as a separate 'block'. So, with monkey-patching magic...

from pandas.core.internals import BlockManager class BlockManagerUnconsolidated(BlockManager): def __init__(self, *args, **kwargs): BlockManager.__init__(self, *args, **kwargs) self._is_consolidated = False self._known_consolidated = False def _consolidate_inplace(self): pass def _consolidate(self): return self.blocks def df_from_arrays(arrays, columns, index): from pandas.core.internals import make_block def gen(): _len = None p = 0 for a in arrays: if _len is None: _len = len(a) assert len(index) == _len assert _len == len(a) yield make_block(values=a.reshape((1,_len)), placement=(p,)) p+=1 blocks = tuple(gen()) mgr = BlockManagerUnconsolidated(blocks=blocks, axes=[columns, index]) return pd.DataFrame(mgr, copy=False)

It would be better if DataFrame or BlockManger had a consolidate=False (or assumed this behavior) if copy=False was specified.

To test:

def assert_readonly(iloc): try: iloc[0] = 999 # Should be non-editable raise Exception("MUST BE READ ONLY (1)") except ValueError as e: assert "read-only" in e.message # Original ndarray n = 1000 _arr = np.arange(0,1000, dtype=float) # Convert it to a memmap mm = np.memmap(filename, mode='w+', shape=_arr.shape, dtype=_arr.dtype) mm[:] = _arr[:] del _arr mm.flush() mm.flags['WRITEABLE'] = False # Make immutable! df = df_from_arrays( [mm, mm, mm], columns=['a', 'b', 'c'], index=range(len(mm))) assert_read_only(df["a"].iloc) assert_read_only(df["b"].iloc) assert_read_only(df["c"].iloc)

It seems a little questionable to me whether there's really practical benefits to BlockManager requiring similarly typed data to be kept together -- most of the operations in Pandas are label-row-wise, or per column -- this follows from a DataFrame being a structure of heterogeneous columns that are usually only associated by their index. Though feasibly they're keeping one index per 'block', gaining benefit if the index keeps offsets into the block (if this was the case, then they should groups by sizeof(dtype), which I don't think is the case). Ho hum...

There was some discussion about a PR to provide a non-copying constructor, which was abandoned.

It looks like there's sensible plans to phase out BlockManager, so your mileage many vary.

Also see Pandas under the hood, which helped me a lot.

更多推荐