快速读取非常大的表作为数据框

编程入门行业动态更新时间:2024-10-23 15:32:22

本文介绍了快速读取非常大的表作为数据框的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有非常大的表(3000 万行)，我想在 R 中作为数据帧加载.read.table() 有很多方便的功能，但似乎有实现中有很多逻辑会减慢速度.就我而言，我假设我提前知道列的类型，该表不包含任何列标题或行名称，并且没有任何我需要担心的病理字符.

I have very large tables (30 million rows) that I would like to load as a dataframes in R. read.table() has a lot of convenient features, but it seems like there is a lot of logic in the implementation that would slow things down. In my case, I am assuming I know the types of the columns ahead of time, the table does not contain any column headers or row names, and does not have any pathological characters that I have to worry about.

我知道使用 scan() 将表格作为列表读取可能非常快，例如:

I know that reading in a table as a list using scan() can be quite fast, e.g.:

datalist <- scan('myfile',sep=' ',list(url='',popularity=0,mintime=0,maxtime=0)))

但是我将其转换为数据帧的一些尝试似乎将上述性能降低了 6 倍:

But some of my attempts to convert this to a dataframe appear to decrease the performance of the above by a factor of 6:

df <- as.data.frame(scan('myfile',sep=' ',list(url='',popularity=0,mintime=0,maxtime=0))))

有没有更好的方法来做到这一点?或者很可能完全不同的方法来解决这个问题?

Is there a better way of doing this? Or quite possibly completely different approach to the problem?

推荐答案

几年后的更新

这个答案是旧的，R 已经继续前进了.调整 read.table 以运行一点更快的好处很少.您的选择是:

This answer is old, and R has moved on. Tweaking read.table to run a bit faster has precious little benefit. Your options are:

使用 vroom 来自 tidyverse 包 vroom 用于将数据从 csv/tab-delimited 文件直接导入 R tibble.请参阅赫克托的回答.

Using vroom from the tidyverse package vroom for importing data from csv/tab-delimited files directly into an R tibble. See Hector's answer.

使用 fread 在 data.table 用于导入将数据从 csv/tab 分隔的文件直接导入 R.请参阅 mnel 的回答.

使用 read_table 在 readr(2015 年 4 月在 CRAN 上).这很像上面的 fread.链接中的 readme 解释了这两个函数之间的区别(readr 目前声称比 data.table::fread).

Using read_table in readr (on CRAN from April 2015). This works much like fread above. The readme in the link explains the difference between the two functions (readr currently claims to be "1.5-2x slower" than data.table::fread).

read.csv.raw 来自 iotools 提供了第三个选项来快速读取 CSV 文件.

read.csv.raw from iotools provides a third option for quickly reading CSV files.

尝试在数据库而不是平面文件中存储尽可能多的数据.(作为更好的永久存储介质，数据以二进制格式传入和传出 R，速度更快.)read.csv.sql 在 sqldf 包，如 JD 中所述Long 的回答，将数据导入临时 SQLite 数据库，然后将其读入 R.另请参阅:RODBC 包，以及 DBI 包页面.MonetDB.R 给出您是一种伪装成数据框但实际上是底层的 MonetDB 的数据类型，从而提高了性能.使用其monetdb.read.csv 功能.dplyr 允许你直接工作数据存储在多种类型的数据库中.

Trying to store as much data as you can in databases rather than flat files. (As well as being a better permanent storage medium, data is passed to and from R in a binary format, which is faster.) read.csv.sql in the sqldf package, as described in JD Long's answer, imports data into a temporary SQLite database and then reads it into R. See also: the RODBC package, and the reverse depends section of the DBI package page. MonetDB.R gives you a data type that pretends to be a data frame but is really a MonetDB underneath, increasing performance. Import data with its monetdb.read.csv function. dplyr allows you to work directly with data stored in several types of database.

以二进制格式存储数据对于提高性能也很有用.使用 saveRDS/readRDS(见下文)，h5 或 rhdf5 用于 HDF5 格式的包，或 write_fst/read_fst-project/package=fst" rel="noreferrer">fst 包.

Storing data in binary formats can also be useful for improving performance. Use saveRDS/readRDS (see below), the h5 or rhdf5 packages for HDF5 format, or write_fst/read_fst from the fst package.

原答案

无论您使用 read.table 还是 scan，都可以尝试一些简单的方法.

There are a couple of simple things to try, whether you use read.table or scan.

设置nrows=数据中的记录数(nmax in scan).

确保 comment.char="" 关闭注释解释.

Make sure that comment.char="" to turn off interpretation of comments.

使用read.table中的colClasses明确定义每一列的类.

Explicitly define the classes of each column using colClasses in read.table.

设置 multi.line=FALSE 也可以提高扫描性能.

Setting multi.line=FALSE may also improve performance in scan.

如果这些都不起作用，那么使用分析包之一以确定哪些线路正在减慢速度.或许你可以根据结果写出一个read.table的精简版.

If none of these thing work, then use one of the profiling packages to determine which lines are slowing things down. Perhaps you can write a cut down version of read.table based on the results.

另一种选择是在将数据读入 R 之前对其进行过滤.

The other alternative is filtering your data before you read it into R.

或者，如果问题是你必须定期读取它，那么使用这些方法一次读取数据，然后将数据帧保存为带有save saveRDS，然后下次您可以使用 load readRDS.

Or, if the problem is that you have to read it in regularly, then use these methods to read the data in once, then save the data frame as a binary blob with save saveRDS, then next time you can retrieve it faster with load readRDS.

更多推荐

快速读取非常大的表作为数据框

本文发布于:2023-11-11 06:19:38，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1577610.html