strstr在巨大的mmapped文件上(strstr on huge mmapped file)

编程入门 行业动态 更新时间:2024-10-28 21:15:28
strstr在巨大的mmapped文件上(strstr on huge mmapped file)

我打开巨大的(11Gb)文件,将其映射到memmory,并且无法搜索文件中的字符串

我的代码是

if ( (fd = open("l", O_RDONLY)) < 0 ) err_sys("Cant open file"); if ( fstat(fd, &statbuf) < 0 ) err_sys("Cant get file size"); printf("size is %ld\n", statbuf.st_size); if ( (src = mmap(0, statbuf.st_size, PROT_READ, MAP_SHARED, fd, 0)) == MAP_FAILED ) err_sys("Cant mmap"); printf("src pointer is at %ld\n", src); char * index = strstr(src, "bin/bash"); printf("needle is at %ld\n", index);

它适用于小文件,但是在巨大的源上返回0.我应该使用什么函数来搜索巨大的mmapped文件?

输出是:

size is 11111745740 src pointer is at 140357526544384 needle is at 0

I open huge (11Gb) file, mmap it to memmory, and fail to search the string in the file

my code is

if ( (fd = open("l", O_RDONLY)) < 0 ) err_sys("Cant open file"); if ( fstat(fd, &statbuf) < 0 ) err_sys("Cant get file size"); printf("size is %ld\n", statbuf.st_size); if ( (src = mmap(0, statbuf.st_size, PROT_READ, MAP_SHARED, fd, 0)) == MAP_FAILED ) err_sys("Cant mmap"); printf("src pointer is at %ld\n", src); char * index = strstr(src, "bin/bash"); printf("needle is at %ld\n", index);

It works on small files, but on huge sources returns 0. What function should I use to search in huge mmapped files?

The output is:

size is 11111745740 src pointer is at 140357526544384 needle is at 0

最满意答案

您不应该使用strstr()来搜索内存映射文件中的文本:

如果文件是二进制文件,则很可能包含空字节,这将很快停止搜索。 这可能是你观察到的。 如果文件是纯文本,但不包含匹配项,则strstr将继续扫描超出文件末尾,通过尝试读取未映射的内存来调用未定义的行为。

您可以使用具有等效语义的函数,但应用于原始内存而不是C语言, memmem() ,可在Linux和BSD系统上使用:

void *memmem(const void *p1, size_t size1, const void *p2, size_t size2);

请注意,您还使用了错误的printf格式:对于src和index ,它应该是%p ,您可能更喜欢将偏移打印为ptrdiff_t或unsigned long long :

if ((fd = open("l", O_RDONLY)) < 0) err_sys("Cannot open file"); if (fstat(fd, &statbuf) < 0) err_sys("Cannot get file size"); printf("size is %llu\n", (unsigned long long)statbuf.st_size); if ((src = mmap(0, statbuf.st_size, PROT_READ, MAP_SHARED, fd, 0)) == MAP_FAILED) err_sys("Cannot mmap"); printf("src pointer is at %p\n", (void*)src); char *index = memmem(src, statbuf.st_size, "bin/bash", strlen("bin/bash")); printf("needle is at %p\n", (void*)index); if (index != NULL) printf("needle is at offset %llu\n", (unsigned long long)(index - src));

如果您的平台上没有memmem ,这是一个简单的实现:

#include <string.h> void *memmem(const void *haystack, size_t n1, const void *needle, size_t n2) { const unsigned char *p1 = haystack; const unsigned char *p2 = needle; if (n2 == 0) return (void*)p1; if (n2 > n1) return NULL; const unsigned char *p3 = p1 + n1 - n2 + 1; for (const unsigned char *p = p1; (p = memchr(p, *p2, p3 - p)) != NULL; p++) { if (!memcmp(p, p2, n2)) return (void*)p; } return NULL; }

You should not use strstr() to search for text in a memory mapped file:

If the file is binary, it most likely contains null bytes that will stop the search too soon. This is probably what you observe. If the file is pure text, but does not contain a match, strstr will keep scanning beyond the end of the file, invoking undefined behavior by attempting to read unmapped memory.

You could instead use a function with equivalent semantics but applied to raw memory instead of C strings, memmem(), available on Linux and BSD systems:

void *memmem(const void *p1, size_t size1, const void *p2, size_t size2);

Note that you also use the wrong printf formats: it should be %p for src and index and you might prefer to print the offset as a ptrdiff_t or an unsigned long long:

if ((fd = open("l", O_RDONLY)) < 0) err_sys("Cannot open file"); if (fstat(fd, &statbuf) < 0) err_sys("Cannot get file size"); printf("size is %llu\n", (unsigned long long)statbuf.st_size); if ((src = mmap(0, statbuf.st_size, PROT_READ, MAP_SHARED, fd, 0)) == MAP_FAILED) err_sys("Cannot mmap"); printf("src pointer is at %p\n", (void*)src); char *index = memmem(src, statbuf.st_size, "bin/bash", strlen("bin/bash")); printf("needle is at %p\n", (void*)index); if (index != NULL) printf("needle is at offset %llu\n", (unsigned long long)(index - src));

If memmem is not available on your platform, here is a simple implementation:

#include <string.h> void *memmem(const void *haystack, size_t n1, const void *needle, size_t n2) { const unsigned char *p1 = haystack; const unsigned char *p2 = needle; if (n2 == 0) return (void*)p1; if (n2 > n1) return NULL; const unsigned char *p3 = p1 + n1 - n2 + 1; for (const unsigned char *p = p1; (p = memchr(p, *p2, p3 - p)) != NULL; p++) { if (!memcmp(p, p2, n2)) return (void*)p; } return NULL; }

更多推荐

本文发布于:2023-08-06 03:52:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1444037.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:文件   mmapped   strstr   huge   file

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!