内存限制在海量文本文件上使用正则表达式(Memory Limit Using Regex on massive text file)

编程入门行业动态更新时间:2024-10-25 02:22:39

我有一个以下形式的文本文件：

('1', '2') ('3', '4') . . .

我试图让它看起来像这样：

1 2 3 4 etc...

我一直在尝试使用python中的re模块，将re.sub命令链接在一起，如下所示：

for line in file: s = re.sub(r"\(", "", line) s1 = re.sub(r",", "", s) s2 = re.sub(r"'", "", s1) s3 = re.sub(r"\)", "", s2) output.write(s3) output.close()

它似乎工作得很好，直到我接近输出文件的结尾; 然后它变得不一致并停止工作。我认为这是因为我正在处理的文件的大小; 300MB或约1200万行。

任何人都可以帮我确认我只是内存不足吗？或者如果它是其他的东西？合适的替代方案或方法？

I have a text file of the following form:

('1', '2') ('3', '4') . . .

and i'm trying to get it to look like this:

1 2 3 4 etc...

I've been trying to do this using the re module in python, by chaining together re.sub commands like so:

for line in file: s = re.sub(r"\(", "", line) s1 = re.sub(r",", "", s) s2 = re.sub(r"'", "", s1) s3 = re.sub(r"\)", "", s2) output.write(s3) output.close()

It seems to work great until I get near the end of my output file; then it becomes inconsistent and stops working. I am thinking it is because of the sheer SIZE of the file I am working with; 300MB or approximately 12 million lines.

Can anyone help me confirm that I'm simply running out of memory? Or if it is something else? Suitable alternatives, or ways around this?

最满意答案

您可以使用更简单的正则表达式来简化代码，该正则表达式可以查找输入中的所有数字：

import re with open(file_name) as input,open(output_name,'w') as output: for line in input: output.write(' '.join(re.findall('\d+', line)) output.write('\n')

You could simplify your code by using a simpler regex that finds all numbers in your input:

import re with open(file_name) as input,open(output_name,'w') as output: for line in input: output.write(' '.join(re.findall('\d+', line)) output.write('\n')

更多推荐

本文发布于:2023-07-23 02:38:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1226555.html