假设我有1GB可用内存,如何在这些网址中找到重复内容?
我在“Cracking the Coding Interview”一书中看到了一个解决方案,它建议在第一次扫描中使用散列表将这些url分成4000个文件x.txt, x = hash(u)%4000 。 在第二次扫描中,我们可以检查每个x.txt单独文件中的重复项。
但是,如何保证每个文件存储大约1GB的URL数据? 我认为有些文件可能比其他文件存储更多的url数据。
我对这个问题的解决方案是迭代地实现文件分离技巧,直到文件足够小以供我可用的内存。
还有其他办法吗?
Suppose I have 1GB memory available, how to find the duplicates among those urls?
I saw one solution on the book "Cracking the Coding Interview", it suggests to use hashtable to separate these urls into 4000 files x.txt, x = hash(u)%4000 in the first scan. And in the 2nd scan, we can check duplicates in each x.txt separately file.
But how can I guarantee that each file would store about 1GB url data? I think there's a chance that some files would store much more url data than other files.
My solution to this problem is to implement the file separation trick iteratively until the files are small enough for the memory available for me.
Is there any other way to do it?
最满意答案
如果您不介意需要更多代码的解决方案,则可以执行以下操作:
仅计算哈希码。 每个哈希码恰好是4个字节,因此您可以完美控制每个哈希码占用的内存量。 您还可以在内存中使用比URL更多的哈希码,因此您将拥有更少的块。
找到重复的哈希码。 据推测,它们将远远少于100亿。 它们甚至可能都适合记忆。
再次浏览URL,重新计算哈希码,查看URL是否具有重复的哈希码之一,然后比较实际URL以排除由于哈希码冲突导致的误报。 (有100亿个网址,并且只有40亿个不同值的哈希码,会有很多冲突。)
If you don't mind a solution which requires a bit more code, you can do the following:
Calculate only the hashcodes. Each hashcode is exactly 4 bytes, so you have perfect control of the amount of memory that will be occupied by each chunk of hashcodes. You can also fit a lot more hashcodes in memory than URLs, so you will have fewer chunks.
Find the duplicate hashcodes. Presumably, they are going to be much fewer than 10 billion. They might even all fit in memory.
Go through the URLs again, recomputing hashcodes, seeing if a URL has one of the duplicate hashcodes, and then comparing actual URLs to rule out false positives due to hashcode collisions. (With 10 billion urls, and with hashcodes only having 4 billion different values, there will be plenty of collisions.)
更多推荐
发布评论