Java:具有状态的ASCII随机行文件访问(Java: ASCII random line file access with state)

编程入门 行业动态 更新时间:2024-10-17 11:22:17
Java:具有状态的ASCII随机行文件访问(Java: ASCII random line file access with state)

有没有比创建符合以下条件的流式文件阅读器类更好的[预先存在的可选Java 1.6]解决方案?

给定任意大尺寸的ASCII文件,其中每行以\n结尾 对于每个调用某个方法的readLine()从文件中读取一个随机行 并且对于文件句柄的生命周期,对readLine()调用不应该返回相同的行两次

更新:

最终必须读取所有行

上下文:文件的内容是从Unix shell命令创建的,以获取给定目录中包含的所有路径的目录列表; 有数百万到十亿个文件(在目标文件中产生数百万到十亿行)。 如果有一些方法可以在创建时间内将路径随机分配到文件中,这也是一种可接受的解决方案。

Is there a better [pre-existing optional Java 1.6] solution than creating a streaming file reader class that will meet the following criteria?

Given an ASCII file of arbitrary large size where each line is terminated by a \n For each invocation of some method readLine() read a random line from the file And for the life of the file handle no call to readLine() should return the same line twice

Update:

All lines must eventually be read

Context: the file's contents are created from Unix shell commands to get a directory listing of all paths contained within a given directory; there are between millions to a billion files (which yields millions to a billion lines in the target file). If there is some way to randomly distribute the paths into a file during creation time that is an acceptable solution as well.

最满意答案

如果文件的数量确实是任意的,那么在内存使用方面跟踪已处理文件可能存在相关问题(如果在文件中跟踪而不是列表或集合,则会出现IO时间)。 保持越来越多的选定产品线的解决方案也遇到与时间相关的问题。

我会考虑以下几点:

创建n个 “桶”文件。 n可以根据考虑到文件和系统内存数量的东西来确定。 (如果n很大,您可以生成n的子集以保持打开文件句柄。) 每个文件的名称都经过哈希处理,并进入相应的存储桶文件,根据任意条件“分片”目录。 读入存储桶文件内容(只是文件名)并按原样处理(由散列机制提供随机性),或者选择rnd(n)并随时删除,提供更多的随机性。 或者,您可以填充并使用随机访问的想法,在选择列表时从列表中删除索引/偏移量。

If the number of files is truly arbitrary it seems like there could be an associated issue with tracking processed files in terms of memory usage (or IO time if tracking in files instead of a list or set). Solutions that keep a growing list of selected lines also run in to timing-related issues.

I'd consider something along the lines of the following:

Create n "bucket" files. n could be determined based on something that takes in to account the number of files and system memory. (If n is large, you could generate a subset of n to keep open file handles down.) Each file's name is hashed, and goes into an appropriate bucket file, "sharding" the directory based on arbitrary criteria. Read in the bucket file contents (just filenames) and process as-is (randomness provided by hashing mechanism), or pick rnd(n) and remove as you go, providing a bit more randomosity. Alternatively, you could pad and use the random access idea, removing indices/offsets from a list as they're picked.

更多推荐

本文发布于:2023-07-14 18:56:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1106678.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:状态   random   ASCII   Java   access

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!