在大目录中查找随机对(Finding random pairs in a large directory)

编程入门 行业动态 更新时间:2024-10-17 19:28:18
在大目录中查找随机对(Finding random pairs in a large directory)

我有~~ 500万个csv文件存储在〜100.000个文件夹中。 每个文件夹包含大致相同数量的文件,文件夹中始终存在偶数个文件。 我需要找到所有这些文件的路径,并按照一个奇怪的顺序将它们加载到一个列表中,以用于统计建模项目。

特别是,我需要坚持以下内容:

唯一性:每个文件只能在列表中一次 对:每个文件必须位于同一文件夹中的另一个文件旁边(如果由于随机性,它可以紧挨着两个文件) 随机性:任何两个没有“配对”的文件彼此相邻的概率应该是相同的(即它不能仅仅迭代所有文件)

我在下面创建了一个例子。

Folder_1 - File_A - File_B - File_C - File_D Folder_2 - File_E - File_F - File_G - File_H

好结果(随机,但坚持对的规则)

paths = ['Folder_1/File_A', 'Folder_1/File_D', 'Folder_2/File_G', 'Folder_2/File_F', 'Folder_2/File_E', 'Folder_2/File_H', 'Folder_1/File_C', 'Folder_1/File_B']

一个简单的方法可能是“选择随机文件夹,在该文件夹中选择随机文件以及文件夹中的随机对。将这些选择保存在列表中以避免再次被选中。重复。”。 但是,这需要太长时间。 你能推荐一个很好的策略来创建这个列表吗? 如果需要,可以稍微放松随机性要求。

I have ~5M csv files stored in ~100.000 folders. Each folder contains roughly the same number of files and there's always an even number of files in a folder. I need to find the paths to all these files and load them into a list in a somewhat strange order for a statistical modeling project.

In particular, I need the following to be upheld:

Uniqueness: Each file must only be in the list once Pairs: Each file must be next to another file from the same folder (it can be next to two if due to randomness) Randomness: The probability of any two files that are not "paired" being next to each other should be the same (i.e. it wouldn't work just to iterative over all files)

I've created an example below.

Files

Folder_1 - File_A - File_B - File_C - File_D Folder_2 - File_E - File_F - File_G - File_H

Good Result (randomized, but upholds the rule of pairs)

paths = ['Folder_1/File_A', 'Folder_1/File_D', 'Folder_2/File_G', 'Folder_2/File_F', 'Folder_2/File_E', 'Folder_2/File_H', 'Folder_1/File_C', 'Folder_1/File_B']

A simple approach might be something like "Pick a random folder, pick a random file in that folder and a random pair in the folder. Save these picks in a list to avoid getting picked again. Repeat.". However, that would take far too long. Can you recommend a good strategy for creating this list? The randomness requirement can be relaxed a bit if needed.

最满意答案

确保一切都是随机的一种方法是使用random.shuffle ,它会在列表中混合列表。 这样你就可以简单地将每个项目与其邻居配对,安全地知道配对是随机的。 要获得类似于您的示例的结果,您可以随后对结果列表进行随机播放和展平。 这是一个例子:

from random import shuffle # generate some sample directory names ls = [[]] * 5 i = 0 while i < len(ls): ls[i] = [str(i) + chr(j) for j in range(97,101)] i += 1 # shuffle files within each directory pairs = [] for l in ls: shuffle(l) pairs += list(zip(l[1::2], l[::2])) # shuffle and flatten the list of pairs shuffle(pairs) flat = [item for sublist in pairs for item in sublist] print(flat)

One way to ensure that everything's random is to use random.shuffle, which shuffles a list inplace. That way you can simply pair each item with its neighbor, safe in the knowledge that the pairing is random. To achieve a result like your example you can then shuffle and flatten the resulting list of pairs. Here's an example:

from random import shuffle # generate some sample directory names ls = [[]] * 5 i = 0 while i < len(ls): ls[i] = [str(i) + chr(j) for j in range(97,101)] i += 1 # shuffle files within each directory pairs = [] for l in ls: shuffle(l) pairs += list(zip(l[1::2], l[::2])) # shuffle and flatten the list of pairs shuffle(pairs) flat = [item for sublist in pairs for item in sublist] print(flat)

更多推荐

本文发布于:2023-08-07 11:52:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1464487.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:目录中   Finding   random   directory   large

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!