如何使用R来查找文件中格式错误的行和字段，这些文件太大而无法读入RAM(How can I use R to find malformed rows and fields in a file too

编程入门行业动态更新时间:2024-10-25 06:30:48

如何使用R来查找文件中格式错误的行和字段，这些文件太大而无法读入RAM(How can I use R to find malformed rows and fields in a file too big to read into RAM)

我有一个文件大于我的计算机上的总RAM大约10倍。我试图让它读入一个R对象，让我看看它，并提取更多可管理的块。我已尝试过各种方法，但每个问题都遇到了问题 - 不同的问题。我有一个固定宽度格式的文件副本，另一个是CSV格式的副本。我相信文件在其他方面是相同的。我已经能够读取前5000行，并且固定宽度文件中的每列都有一个暂定的字段宽度，两个文件的每列都有一个暂定的数据类。在这一点上，我不是在问如何实现我的总体目标。相反，我想排除（或证明）数据的畸形作为我的错误的来源。如果我读完整个文件，我会知道如何做到这一点。事实上，我没有。

所以这是我的问题：在R中是否有一种方法可以逐行读取固定宽度或CSV数据而无需将整个文件读入内存，并且：对于CSV，请检查：•如果字段数始终相同，并返回不是的行号; •如果每个字段中的数据与列类一致，则返回行号和列号或名称

对于固定宽度，请检查：•如果字符数始终相同，则返回行号（如果不是）; •如果每个字段中的数据与列类一致; 并返回字段中的行号和第一个字符的编号，列号或列名（如果不是）;

最后，对于这两种情况，我希望该方法告诉我它检查了多少行（以确保它到达文件的末尾），并且我想要一种按行号提取任意行的副本的方法，以便我可以看看它们（再次没有将整个文件读入内存）。

对于固定宽度和CSV情况，对列类的检查必须是健壮的，因为某些字段或字符不存在或格式不正确，即，它仍然应该告诉我关于行的明智的事情，并且仍然继续看在下一行。

也许有一个包或功能这样做？除了大文件问题之外，它似乎是一个相当标准的数据清理任务。

任何帮助将不胜感激。

真诚的，andrewH

I have a file that is larger than the total RAM on my computer by a factor of roughly 10. I am trying to getting it read in to an R Object that will let me look at it and extract more manageable chunks of it. I have tried various approaches to this, but have run into problems – different problems – with each. I have a copy of the file in fixed-width format, and another as a CSV. I believe the files are otherwise identical. I have been able to read the first 5000 lines and have a tentative field width for each column in the fixed width file and a tentative data class for each column for both files. At this point, I am not asking how to achieve my overall objective. Instead, I would like to rule out (or prove) malformation of the data as the source of my errors. If I had the whole file read I would have some idea how to do this. As it is, I do not.

So here is my question: Is there a way in R to read in fixed width or CSV data line by line without reading the whole file into memory, and: for the CSV, check: • if the number of fields is always the same, and return the row numbers where it is not; • if the data in each field is consistent with the column class, and return the row number and column number or name where it is not

for the fixed-width, check: • if the number of characters is always the same, and return the row number if it is not; • if the data in each field is consistent with the column class; and return the row number and the number of the first character in the field, or the column number, or the column name, if it is not;

Finally, for both cases I would like the method to tell me how many rows it has examined in all (to make sure it got to the end of the file), and I would like a way to extract copies of arbitrary rows by row number, so that I can look at them (again without reading the whole file into memory).

In for both the fixed-width and the CSV cases, the checking for column classes has to be robust to having some fields or characters absent or malformed, i.e., it should still tell me sensible things about the row, and still go on to look at the next row.

Maybe there is a package or function that does this? It seems like a fairly standard data-cleaning task, except for the large-file problem.

Any help would be greatly appreciated.

Sincerely, andrewH

最满意答案

选项1：我对“真实生活情境”中的fwf数据经验有限，但对于大型CSV文件，我发现count.fields函数非常有用。尝试这个：

(table(cnts <- count.fields(paste0(path,filename), sep=",", quote="", comment.char="") )

然后，您可以在cnts中搜索具有异常值的行号。例如，如果您注意到只有10-20个字段计数为47，而其余为48，那么您可以打印出这些位置：

which(cnts=47)

选项2：我很确定我已经在系统级使用sed和grep来计算字段分隔符的解决方案。我从一些NIX论坛拼凑了这个，它给了我一个结构良好的四行文件中的字段计数表：

fct <- table(system("awk -F ',' '{print NF}' A.csv", intern=TRUE)) fct #3 #4

计算1.2 MM记录数据集中的字段需要6秒钟，并且没有数据被带入R：

system.time( fct <- table(system("awk -F ',' '{print NF}' All.csv", intern=TRUE)) ) # user system elapsed # 6.597 0.215 6.552

您可以通过以下方式获得行数：

sum(fct)

Option 1: I have limited experience with fwf data in "real life situations", but for large CSV files have found the count.fields function to be very helpful. Try this:

(table(cnts <- count.fields(paste0(path,filename), sep=",", quote="", comment.char="") )

Then you can search in cnts for the line numbers with outlier values. For instance, if you noticed that there were only 10-20 field counts of 47 while the rest were 48, you might print out those locations:

which(cnts=47)

Option 2: I'm pretty sure I have seen solutions to this using sed and grep at a system level for counting field separators. I cobbled this together from some NIX forums and it gives me a table of counts of fields in a four line file that is well structured:

fct <- table(system("awk -F ',' '{print NF}' A.csv", intern=TRUE)) fct #3 #4

And it took 6 seconds to count the fields in a 1.2 MM record dataset and none of the data were brought into R:

system.time( fct <- table(system("awk -F ',' '{print NF}' All.csv", intern=TRUE)) ) # user system elapsed # 6.597 0.215 6.552

You can get the count of lines with :

sum(fct)

更多推荐

本文发布于:2023-08-07 16:29:00，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1465121.html