我已经看到了一些方法,可以用Python将格式化的二进制文件读取到Pandas,也就是说,我正在使用这段代码,该代码使用NumPy从以dtype给出的结构格式化的文件中读取.
I've seen some ways to read a formatted binary file in Python to Pandas, namely, I'm using this code that read using NumPy fromfile formatted with a structure given using dtype.
import numpy as np import pandas as pd input_file_name = 'test.hst' input_file = open(input_file_name, 'rb') header = input_file.read(96) dt_header = np.dtype([('version', 'i4'), ('copyright', 'S64'), ('symbol', 'S12'), ('period', 'i4'), ('digits', 'i4'), ('timesign', 'i4'), ('last_sync', 'i4')]) header = np.fromstring(header, dt_header) dt_records = np.dtype([('ctm', 'i4'), ('open', 'f8'), ('low', 'f8'), ('high', 'f8'), ('close', 'f8'), ('volume', 'f8')]) records = np.fromfile(input_file, dt_records) input_file.close() df_records = pd.DataFrame(records) # Now, do some changes in the individual values of df_records # and then write it back to a binary file现在,我的问题是如何将其写回到新文件中.我无法在NumPy中找到任何函数(在Pandas中都找不到),可以让我确切指定要在每个字段中写入的字节.
Now, my issue is on how to write this back to a new file. I can't find any function in NumPy (neither in Pandas) that allows me to specify exactly the bytes to use in each field to write.
推荐答案熊猫现在提供比tofile()更稳定的多种格式.tofile()最适合用于快速文件存储,在这种情况下,您不希望文件在数据可能具有不同字节序(big-/little-endian)的另一台计算机上使用.
Pandas now offers a wide variety of formats that are more stable than tofile(). tofile() is best for quick file storage where you do not expect the file to be used on a different machine where the data may have a different endianness (big-/little-endian).
Format Type Data Description Reader Writer text CSV read_csv to_csv text JSON read_json to_json text HTML read_html to_html text Local clipboard read_clipboard to_clipboard binary MS Excel read_excel to_excel binary HDF5 Format read_hdf to_hdf binary Feather Format read_feather to_feather binary Parquet Format read_parquet to_parquet binary Msgpack read_msgpack to_msgpack binary Stata read_stata to_stata binary SAS read_sas binary Python Pickle Format read_pickle to_pickle SQL SQL read_sql to_sql SQL Google Big Query read_gbq to_gbq对于中小型文件,我更喜欢CSV,因为格式正确的CSV可以存储任意字符串数据,易于阅读,并且在实现上述两个目标的同时,它与任何格式一样简单.
For small to medium sized files, I prefer CSV, as properly-formatted CSV can store arbitrary string data, is human readable, and is as dirt-simple as any format can be while achieving the previous two goals.
一次,我使用了HDF5,但是如果我在亚马逊上,我会考虑使用镶木地板.
At one time, I used HDF5, but if I were on Amazon, I would consider using parquet.
使用 to_hdf 的示例:
df.to_hdf('tmp.hdf','df', mode='w') df2 = pd.read_hdf('tmp.hdf','df')我不再支持HDF5格式.由于相当复杂,因此具有长期存档的严重风险.它具有150页的规范,并且只有一个300,000行C实现.
I no longer favor the HDF5 format. It has serious risks for long-term archival since it is fairly complex. It has a 150 page specification, and only one 300,000 line C implementation.
相反,只要您专门使用Python, pickle格式声明长期稳定性:
In contrast, as long as you are working exclusively in Python, the pickle format claims long term stability:
泡菜序列化格式保证是向后的兼容所有Python版本,提供了兼容的pickle选择了协议,并且使用Python处理和解开代码如果您的数据跨越了唯一性,那么2至Python 3的类型差异打破改变语言的界限.
The pickle serialization format is guaranteed to be backwards compatible across Python releases provided a compatible pickle protocol is chosen and pickling and unpickling code deals with Python 2 to Python 3 type differences if your data is crossing that unique breaking change language boundary.
但是,泡菜允许任意代码执行,因此应谨慎处理来源不明的泡菜.
However, pickles allow arbitrary code execution so care should be exercised with pickles of unknown origin.
更多推荐
从Pandas Dataframe写入格式化的二进制文件
发布评论