快速读取（通过块？）和在R中以规则间隔处理具有虚拟线的文件

编程入门行业动态更新时间:2024-10-23 17:37:17

本文介绍了快速读取（通过块？）和在R中以规则间隔处理具有虚拟线的文件的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有一个文件，其中包含许多数组的常规数字输出（相同格式），每个数组由一行（包含一些信息）分隔。例如：

I have a file with regular numeric output (same format) of many arrays, each separated by a single line (containing some info). For example:

library(gdata) nx = 150 # ncol of my arrays ny = 130 # nrow of my arrays myfile = 'bigFileWithRowsToSkip.txt' niter = 10 for (i in 1:niter) { write(paste(i, 'is the current iteration'), myfile, append=T) z = matrix(runif(nx*ny), nrow = ny) # random numbers with dim(nx, ny) write.fwf(z, myfile, append=T, rownames=F, colnames=F) #write in fixed width format }

使用 nx = 5 和 ny = 2 ，我会有一个这样的文件：

With nx=5 and ny=2, I would have a file like this:

# 1 is the current iteration # 0.08051668 0.19546772 0.908230985 0.9920930408 0.386990316 # 0.57449532 0.21774728 0.273851698 0.8199024885 0.441359571 # 2 is the current iteration # 0.655215475 0.41899060 0.84615044 0.03001664 0.47584591 # 0.131544592 0.93211342 0.68300161 0.70991368 0.18837031 # 3 is the current iteration # ...

我想尽可能快地读取连续数组，将它们放在一个 data.frame 中（实际上，我有数千个）。什么是最有效的方法？

I want to read the successive arrays as fast as possible to put them in a single data.frame (in reality, I have thousands of them). What is the most efficient way to proceed?

鉴于输出是常规的，我认为 readr 会很好想法（？）。我能想到的唯一方法是用块手动完成它以消除无用的信息行：

Given the output is regular, I thought readr would be a good idea (?). The only way I can think of, is to do it manually by chunks in order to eliminate the useless info lines:

library(readr) ztot = numeric(niter*nx*ny) # allocate a vector with final size # (the arrays will be vectorized and successively appended to each other) for (i in 1:niter) { nskip = (i-1)*(ny+1) + 1 # number of lines to skip, including the info lines z = read_table(myfile, skip = nskip, n_max = ny, col_names=F) z = as.vector(t(z)) ifirst = (i-1)*ny*nx + 1 # appropriate index ztot[ifirst:(ifirst+nx*ny-1)] = z } # The arrays are actually spatial rasters. Compute the coordinates # and put everything in DF for future analysis: x = rep(rep(seq(1:nx), ny), niter) y = rep(rep(seq(1:ny), each=nx), niter) myDF = data.frame(x=x, y=y, z=z)

但这还不够快。我怎样才能更快地实现这一目标？

But this is not fast enough. How can I achieve this faster?

有没有办法一次读取所有内容并在之后删除无用的行？

Is there a way to read everything at once and delete the useless rows afterwards?

或者，是否没有读取函数接受具有精确位置的向量作为跳过参数，而不是单个初始行数？

Alternatively, is there no reading function accepting a vector with precise locations as skip argument, rather than a single number of initial rows?

PS：注意在不同目录中的许多文件（相同结构）上重复读取操作，以防它影响解决方案......

编辑以下解决方案（阅读 readLines所有行并删除不受欢迎的那些，然后处理其余的）是一个更快的替代方案 niter 非常高：

EDIT The following solution (reading all lines with readLines and removing the undesirable ones and then processing the rest) is a faster alternative with niter very high:

bylines <- readLines(myfile) dummylines = seq(1, by=(ny+1), length.out=niter) bylines = bylines[-dummylines] # remove dummy, undesirable lines asOneChar <- paste(bylines, collapse='\n') # Then process output from readLines library(data.table) ztot <- fread(asOneVector) ztot <- c(t(ztot))

可以找到关于如何从 readLines 继续结果的讨论其他

Discussion on how to proceed results from the readLines can be found here

推荐答案

使用命令行工具预处理文件（即，不在 R 中）实际上更快。例如，使用 awk ：

Pre-processing the file with a command line tool (i.e., not in R) is actually way faster. For example with awk:

tmpfile <- 'cleanFile.txt' mycommand <- paste("awk '!/is the current iteration/'", myfile, '>', tmpfile) # "awk '!/is the current iteration/' bigFileWithRowsToSkip.txt > cleanFile.txt" system(mycommand) # call the command from R ztot <- fread(tmpfile) ztot <- c(t(ztot))

可以基于 pattern 或 indices 。这是@Roland从此处。

更多推荐

快速读取（通过块？）和在R中以规则间隔处理具有虚拟线的文件

本文发布于:2023-11-11 06:20:31，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1577611.html