我试着回答这个使用外部排序,但面试官回答说,复杂性是很高ン(的log(n))例如:N平方* LOGN。 有没有更好的选择。
I tried answering this using external sort, but interviewer replied that the complexity was to high n.n(log(n)) i.e. n square *logn. Is there a better alternative.
要简化问题: 让我们假设我们有1000个元素与分配给只有100元的空间来进行排序。是什么,将采取较小的时间比外部排序的最佳算法。
To simplify the question: Lets suppose we have 1000 elements to sort with space allocated for 100 elements only. What is the best algorithm that will take lesser time than the external sort.
推荐答案我不知道哪个外部排序,你(或面试)的意思,但
I don't know which external sort you (or the interviewer) meant, but
我的建议是10路(你的情况)合并:
my suggestion is a 10-way (in your case) merge:
- 将文件分割成max_mem的大小的块(100元)
- 这是 O(1)
- Split the file into chunks of MAX_MEM size (100 elements)
- this is O(1)
- 这是 O((N / max_mem的)*(max_mem的)日志(max_mem的))) = O(N日志(max_mem的))
- this is O((n/max_mem) * (max_mem) log(max_mem))) = O(n log(max_mem))
- 这是 O(N日志(N / max_mem的))使用minHeap或为O(n ^ 2 / max_mem的)平凡(可能会快一些在实践中)
- this is O(n log(n/max_mem)) using a minHeap or O(n^2/max_mem) trivially (may be faster in practice)
关于运算,这是为O(n(日志(max_mem的)+的log(n / max_mem的))) = O(N日志( N))
Concerning computation, this is O(n (log(max_mem)+log(n/max_mem)))=O(n log(n))
关于磁盘I / O,如果所有的合并是在一传做的,这是 2 * N 读取和 2 * N 写的只有的。 更一般地,它的(1+ [合并树的深度])* N
Concerning disk I/O, if all merging is done in one pass, this is 2*n reads and 2*n writes only. More generally, it's (1+[depth of the merge tree])*n
所有的写操作是连续的。 第一读取是顺序,第二个是连续的,从10个文件交织。
All writes are sequential. The first read is sequential, the second one is sequential, interleaved from 10 files.
如果有更多的数据,你需要重复或递归的合并(100块,然后选择Ñ块重复)。在这一点上,它的价值取代了分裂+排序与置换/选择步骤如@阿密特的回答说明,特别是当数据已经差不多排序(你可以逃避合成步骤完全)。
If there was much more data, you'd need repeated or recursive merge (100 per chunk, then pick N chunks repeatedly). At this point, it's worth replacing the split+sort step with Replacement/Selection as described in the @amit's answer, especially when the data is already almost sorted (you may evade the merging step completely).
请注意,较高的N可以提高计算(非常轻微,如果你使用正确的结构),但减少磁盘的I / O量显著(达到一定的量;如果一次合并了太多的块,你可以耗尽内存的读取缓冲区,造成不必要的读取)。磁盘I / O是昂贵的,CPU周期都没有。
Note that higher N may increase computation (very slightly, if you use the right structures), but reduces the amount of disk I/O significantly (up to a certain amount; if you merge too many chunks at once, you may run out of memory for the read buffers, causing unneccessary reads) . Disk I/O is expensive, CPU cycles are not.
更多推荐
面试之谜:排序内存有限的一百万数字输入
发布评论