在 C++ 中快速读取文本文件

编程入门 行业动态 更新时间:2024-10-23 05:33:04
本文介绍了在 C++ 中快速读取文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我目前正在用 C++ 编写一个程序,其中包括读取大量大文本文件.每个都有 ~400.000 行,在极端情况下每行 4000 个或更多字符.只是为了测试,我使用 ifstream 和 cplusplus 提供的实现读取了其中一个文件.花了大约 60 秒,这太长了.现在我想知道,有没有一种直接的方法可以提高阅读速度?

I am currently writing a program in c++ which includes reading lots of large text files. Each has ~400.000 lines with in extreme cases 4000 or more characters per line. Just for testing, I read one of the files using ifstream and the implementation offered by cplusplus. It took around 60 seconds, which is way too long. Now I was wondering, is there a straightforward way to improve reading speed?

我使用的代码或多或少是这样的:

edit: The code I am using is more or less this:

string tmpString; ifstream txtFile(path); if(txtFile.is_open()) { while(txtFile.good()) { m_numLines++; getline(txtFile, tmpString); } txtFile.close(); }

编辑 2:我读取的文件只有 82 MB.我主要说可以达到4000,因为我认为可能需要知道才能进行缓冲.

edit 2: The file I read is only 82 MB big. I mainly said that it could reach 4000 because I thought it might be necessary to know in order to do buffering.

编辑 3:感谢大家的回答,但鉴于我的问题,似乎没有太大的改进空间.我必须使用 readline,因为我想计算行数.将 ifstream 实例化为二进制也不会使读取速度更快.我会尽可能地将它并行化,这至少应该可行.

edit 3: Thank you all for your answers, but it seems like there is not much room to improve given my problem. I have to use readline, since I want to count the number of lines. Instantiating the ifstream as binary didn't make reading any faster either. I will try to parallelize it as much as I can, that should work at least.

编辑 4:显然我可以做一些事情.非常感谢 sehe 在这方面投入了这么多时间,我非常感谢!=)

edit 4: So apparently there are some things I can to. Big thank you to sehe for putting so much time into this, I appreciate it a lot! =)

推荐答案

更新:请务必查看初始答案下方的(令人惊讶的)更新

Updates: Be sure to check the (surprising) updates below the initial answer

内存映射文件对我很有用1:

Memory mapped files have served me well1:

#include <boost/iostreams/device/mapped_file.hpp> // for mmap #include <algorithm> // for std::find #include <iostream> // for std::cout #include <cstring> int main() { boost::iostreams::mapped_file mmap("input.txt", boost::iostreams::mapped_file::readonly); auto f = mmap.const_data(); auto l = f + mmap.size(); uintmax_t m_numLines = 0; while (f && f!=l) if ((f = static_cast<const char*>(memchr(f, ' ', l-f)))) m_numLines++, f++; std::cout << "m_numLines = " << m_numLines << " "; }

这应该很快.

如果它可以帮助您测试这种方法,这里有一个版本 使用 mmap 直接而不是使用 Boost:

In case it helps you test this approach, here's a version using mmap directly instead of using Boost: see it live on Coliru

#include <algorithm> #include <iostream> #include <cstring> // for mmap: #include <sys/mman.h> #include <sys/stat.h> #include <fcntl.h> const char* map_file(const char* fname, size_t& length); int main() { size_t length; auto f = map_file("test.cpp", length); auto l = f + length; uintmax_t m_numLines = 0; while (f && f!=l) if ((f = static_cast<const char*>(memchr(f, ' ', l-f)))) m_numLines++, f++; std::cout << "m_numLines = " << m_numLines << " "; } void handle_error(const char* msg) { perror(msg); exit(255); } const char* map_file(const char* fname, size_t& length) { int fd = open(fname, O_RDONLY); if (fd == -1) handle_error("open"); // obtain file size struct stat sb; if (fstat(fd, &sb) == -1) handle_error("fstat"); length = sb.st_size; const char* addr = static_cast<const char*>(mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0u)); if (addr == MAP_FAILED) handle_error("mmap"); // TODO close fd at some point in time, call munmap(...) return addr; }

更新

通过查看 GNU coreutils wc 的源代码,我发现了我可以从中挤出的最后一点性能.令我惊讶的是,使用以下改编自 wc 的(大大简化的)代码 运行上述内存映射文件的时间约为 84%:

Update

The last bit of performance I could squeeze out of this I found by looking at the source of GNU coreutils wc. To my surprise using the following (greatly simplified) code adapted from wc runs in about 84% of the time taken with the memory mapped file above:

static uintmax_t wc(char const *fname) { static const auto BUFFER_SIZE = 16*1024; int fd = open(fname, O_RDONLY); if(fd == -1) handle_error("open"); /* Advise the kernel of our access pattern. */ posix_fadvise(fd, 0, 0, 1); // FDADVICE_SEQUENTIAL char buf[BUFFER_SIZE + 1]; uintmax_t lines = 0; while(size_t bytes_read = read(fd, buf, BUFFER_SIZE)) { if(bytes_read == (size_t)-1) handle_error("read failed"); if (!bytes_read) break; for(char *p = buf; (p = (char*) memchr(p, ' ', (buf + bytes_read) - p)); ++p) ++lines; } return lines; }

1 见例如这里的基准:如何解析空格分隔在 C++ 中快速浮动?

更多推荐

在 C++ 中快速读取文本文件

本文发布于:2023-11-11 06:17:01,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1577605.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:文本文件   快速

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!