Python可以在内存中创建大的dicts，但不能从文件中加载它们(Python can create large dicts in memory but can not load them from

编程入门行业动态更新时间:2024-10-19 00:26:34

Python可以在内存中创建大的dicts，但不能从文件中加载它们(Python can create large dicts in memory but can not load them from a file)

我的工作涉及使用python defaultdict(set)创建查找表。构建所有这些dicts需要大约20分钟，并使用大约2GB的RAM。我试图通过将所有这些dicts写入.py文件然后通过导入加载它们来节省时间。

我正在使用theFile.write("idToName = {}\n".format(dict(idToName)))编写文件以删除defaultdict类的set部分。该文件大约500MB，并且所有工作都可以找到。但是，当我尝试将文件导入其中时，填充我的内存并锁定所有内容。什么会导致ram使用的这种差异？

My work involves creating look up tables with python defaultdict(set). Building all of these dicts takes about 20 minutes and uses about 2GB of ram. I am trying to save time by writing all of these dicts to a .py file and then loading them back in with an import.

I'm writing the file with theFile.write("idToName = {}\n".format(dict(idToName))) to remove the set part of the defaultdict class. The file is about 500MB and the dicts all work fine. However, when I try to import the file back in it fills my ram and locks everything up. What would be causing this difference in ram usage?

最满意答案

我猜你正在抓住计算机内存的极限。当你将一个巨大的字典写入.py文件时，你当然也有一个巨大的.py文件。如果您现在尝试导入它，那么Python解释器需要做的不仅仅是将字典保存在内存中。它需要打开源文件，读取，编译，将其字节码表示（编译结果）写入.pyc文件，然后执行它，最后再次在内存中创建字典。所有这些意味着同时在内存中保存多种格式的数据。

我认为你的方法存在缺陷。不应通过编写.py文件来存储数据。使用称为序列化的技术（有时也称为编组）来存储它要好得多，并且在Python的情况下，也称为pickling因为它可以使用标准模块pickle （或cPickle以获得更好的性能）来完成。

您应该在创建值后使用pickle模块存储您的值（字典）。然后当您再次需要它们时，再次从pickle商店文件中读取值：

import pickle value = create_my_huge_dictionary() with open('my_dictionary.pickle', 'w') as store_file: pickle.store(store_file, value)

然后，可能在另一个脚本中：

import pickle with open('my_dictionary.pickle') as store_file: value = pickle.load(store_file)

仍然是关于要删除的defaultdict的主题。上述方法不会这样做。在pickle文件中存储defaultdict并从那里再次读取值将重新创建一个defaultdict ，而不是dict 。

我的建议是与之defaultdict ，因为有一个defaultdict而不是一个字典可能不会受到伤害。但是，如果这不可行，你应该考虑不首先使用defaultdict 。您可以使用此模式使用普通dict来实现其功能：

d = {} d.setdefault('a', {}).setdefault('b', 4) # d will now be {'a': {'b': 4}}

当然，您可以尝试在pickle之前或之后将 defaultdict转换为dict 。你可以通过简单地说明d = dict(d)来做到这一点。但这肯定意味着在记忆中有两次短时间。也许你的RAM没有受到影响，你又被卡住了。

如果你使用json存储你的字典（也许这很简单），那么它曾经是一个defaultdict的信息在序列化之后也就消失了。

I guess that you are scraping at the limit of your computer's RAM. When you write a giant dictionary into a .py file, you have of course a gigantic .py file as well. If you now try to import this, then the Python interpreter needs to do more than just hold the dictionary in memory. It needs to open the source file, read it, compile it, write its bytecode representation (the compile result) to the .pyc file, then execute it which will finally create the dictionary in memory again. All this means holding the data in more than one format in memory at the same time.

I think your approach is flawed. Data should not be stored by writing .py files. It is much better to store it using a technique called serializing, also called marshalling sometimes, and, in case of Python, also called pickling because it can be done with the standard module pickle (or cPickle for better performance).

You should store your values (the dictionaries) using the pickle module after creation of the values. Then when you need them again, read the values again from the pickle store file:

import pickle value = create_my_huge_dictionary() with open('my_dictionary.pickle', 'w') as store_file: pickle.store(store_file, value)

Then later, maybe in a different script:

import pickle with open('my_dictionary.pickle') as store_file: value = pickle.load(store_file)

Remains the topic about the defaultdict you want to strip. The method mentioned above won't do that. Storing a defaultdict in a pickle file and reading the value again from there will recreate a defaultdict, not a dict.

My proposal would be to live with that because it probably won't hurt to have a defaultdict instead of a dict. But just in case this isn't feasible, you should consider to just not use defaultdicts in the first place. You can achieve their feature by using normal dicts with this pattern:

d = {} d.setdefault('a', {}).setdefault('b', 4) # d will now be {'a': {'b': 4}}

Of course you could try to convert your defaultdict into a dict before or after pickling it. You can do that by simply stating d = dict(d). But that would most certainly mean to have it for a short time twice in memory. Maybe your RAM doesn't suffer that and you are stuck again.

If you use json for storing your dictionary (maybe it is simple enough for this), then the information that it once was a defaultdict also is gone after serializing it.

更多推荐

本文发布于:2023-08-04 12:51:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1415996.html