高效的算法找出所有最大的子集

编程入门行业动态更新时间:2024-10-14 00:24:37

本文介绍了高效的算法找出所有最大的子集的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我有独特的套（psented位掩码重新$ P $）的集合，想消除在另一个元素的适当子集的所有元素。例如：

I have a collection of unique sets (represented as bit masks) and would like to eliminate all elements that are proper subsets of another element. For example:

input = [{1, 2, 3}, {1, 2}, {2, 3}, {2, 4}, {}] output = [{1, 2, 3}, {2, 4}]

我一直没能找到一个标准的算法对于这一点，甚至一个名字对于这个问题，所以我把它称为最大的子集，由于缺乏别的。下面是一个为O（n ^ 2）算法（Python中的具体性），假设 is_subset_func 为O（1）： 1

def eliminate_subsets(a, cardinality_func, is_subset_func): out = [] for element in sorted(a, reverse=True, key=cardinality_func): for existing in out: if is_subset_func(element, existing): break else: out.append(element) return out

有没有更有效的算法，希望为O（n log n）的或更好的？

Is there a more efficient algorithm, hopefully O(n log n) or better?

1 有关大小不变的位掩码，因为在我的情况属实， is_subset_func 只是元素和放大器;现有的==元素，它运行在固定的时间。

1 For bit masks of constant size, as is true in my case, is_subset_func is just element & existing == element, which runs in constant time.

推荐答案

假设你标记所有的输入集。

Suppose you label all the input sets.

A={1, 2, 3}, B={1, 2}, C={2, 3}, D={2, 4}, E={}

现在建立中间套，每个宇宙中的元素之一，含出现在那里集合的标签：

Now build intermediate sets, one per element in the universe, containing the labels of the sets where it appears:

1={A,B} 2={A,B,C,D} 3={A,C} 4={D}

现在为每个输入设置计算所有标签的集合元素的交集：

Now for each input set compute the intersection of all the label sets of its elements:

For A, {A,B} intesect {A,B,C,D} intersect {A,C} = {A} (*)

如果该交叉口包含一些标签比其本身等，那么它的那一套山子集。这里不存在其他元素，因此答案是否定的。但是，

If the intersection contains some label other than for the set itself, then it's s a subset of that set. Here there is no other element, so the answer is no. But,

For C, {A,B,C,D} intersect {A,C} = {A,C}, which means that it's a subset of A.

此方法的成本取决于套的实施。假设位图（如你暗示）。说有最大尺寸为M×N个输入集和| U |在宇宙中的项目。然后，中间套建筑产生| U |集大小n位的，所以O（| U | n）的时间对它们进行初始化。设置位需要O（nm）的时间。计算每个路口截至（*）上面需要O（MN）; O（MN ^ 2）所有。

The cost of this method depends on the implementation of sets. Suppose bitmaps (as you hinted). Say there are n input sets of maximum size m and |U| items in the universe. Then the intermediate set construction produces |U| sets of size n bits, so there is O(|U|n) time to initialize them. Setting the bits requires O(nm) time. Computing each intersection as at (*) above requires O(mn); O(mn^2) for all.

将所有这些结合在一起，我们有O（| U | N）+ O（NM）+ O（MN ^ 2）= O（| U | N + MN ^ 2）。使用相同的约定，你的所有对算法是O（| U | ^ 2 N ^ 2）。由于M＆LT; = | U |，这种算法是渐近速度更快。它可能会更快在实践中也因为没有详细的簿记添加常数因子。

Putting all these together we have O(|U|n) + O(nm) +O(mn^2) = O(|U|n + mn^2). Using the same conventions, your "all pairs" algorithm is O(|U|^2 n^2). Since m <= |U|, this algorithm is asymptotically faster. It's likely to be faster in practice as well because there's no elaborate bookkeeping to add constant factors.

增加：在线版

的OP问，如果有一个在线版本这种算法，即，一个在那里可以保持这组最大集增量作为输入组到达一个接酮。答案似乎是肯定的。中间套告诉我们快，如果一组新是的子集的之一已经出现。但是如何分辨快，如果它是一个的超的？并且，如果是的话，其中存在的最大集？对于在这种情况下，这些最大集不再最大，必须由新的取代。

The OP asked if there is an online version of this algorithm, i.e. one where the set of maximal sets can be maintained incrementally as input sets arrive one-by-one. The answer seems to be yes. The intermediate sets tell us quickly if a new set is a subset of one already seen. But how to tell quickly if it's a superset? And, if so, of which existing maximal sets? For in this case those maximal sets are no longer maximal and must be replaced by the new one.

关键是要注意的是 A 是 B 当且仅当 A'的一个超集是 B （打勾'表示设置补码）的一个子集。

The key is to note that A is a superset of B iff A' is a subset of B' (the tick' denoting set complement).

在此之后的灵感，我们像以前一样保持中间集。当一个新的输入设置取值到达时，做同样的测试上面所描述的：让 I（E）是中间输入元素集电子。那么这个测试

Following this inspiration, we maintain the intermediate set as before. When a new input set S arrives, do the same test as described above: Let I(e) be the intermediate set for input element e. Then this test is

For X = \intersect_{e \in S} . I(e), |X| > 0

（在这种情况下，它是大于零而不是一个如上因为取值还没有在我。）如果测试成功，那么新的集合是一个现有的最大集合的一个（可能是不正确的）子集，因此它可以被丢弃。

(In this case it's greater than zero rather than one as above because S is not yet in I.) If the test succeeds, then the new set is a (possibly improper) subset of an existing maximal set, so it can be discarded.

否则，我们必须添加取值作为一种新的最大集合，但在此之前，计算：

Otherwise we must add S as a new maximal set, but before doing this, compute:

Y = \intersect_{e \in S'} . I'(e) = ( \union_{e \in S'} . I(e) )'

在这里再次勾选设置的补充。工会的形式可能有点快来计算。是包含已经被所取代取值的最大集。他们必须从最大的收集和我被删除。最后补充取值作为最大集合和更新我与取值的元素。

where again the tick' is set complement. The union form may be a bit faster to compute. Y contains the maximal sets that have been superceded by S. They must be removed from the maximal collection and from I. Finally add S as a maximal set and update I with S's elements.

让我们的工作，通过我们的榜样。当 A 到达时，我们把它添加到我并

Let's work through our example. When A arrives, we add it to I and have

1={A} 2={A} 3={A}

在 B 到达时，我们发现 X = {A}交{A} = {A} ，所以抛出 B 走，继续。同样的情况对于 C 。当 D 到达时，我们发现 X = {A}交{} = {} ，所以继续用 Y = I'（1）相交我'（3）= {}相交{} 。这个正确的告诉我们，最大的集 A 不包含在 D ，所以没有什么可删除。但必须添加为新的最大集合，而我变为

When B arrives, we find X = {A} intersect {A} = {A}, so throw B away and continue. The same happens for C. When D arrives we find X = {A} intersect {} = {}, so continue with Y = I'(1) intersect I'(3) = {} intersect {}. This correctly tells us that maximal set A is not contained in D, so there is nothing to delete. But it must be added as a new maximal set, and I becomes

1={A} 2={A,D} 3={A} 4={D}

电子的到来不会引起变化。假定有一个新的组的到来则 F = {2，3，4，5} 。我们发现

The arrival of E causes no change. Posit the arrival then of a new set F={2, 3, 4, 5}. We find

X = {A} isect {A,D} isect {A} isect {D} isect {}

，所以我们不能把 F 离开。继续

Y = \intersect_{e in {1}} I'(e) = I'(1) = {D}

这告诉我们 D 是 F 的一个子集，因此应该被丢弃，而 ˚F添加，留下

This tells us D is a subset of F, so should be discarded while F is added, leaving

1={A} 2={A,F} 3={A,F} 4={F} 5={F}

该补的计算既棘手和漂亮的，由于该算法的在线本质。宇宙输入补充，只需要包括迄今为止看到输入元素。宇宙中间套仅由套在目前最大的集标签。对于许多输入流这一组的大小将稳定或随时间降低。

The computation of the complements is both tricky and nice due to the algorithm's online nature. The universe for input complements need only include input elements seen so far. The universe for intermediate sets consists only of tags of sets in the current maximal collection. For many input streams the size of this set will stabilize or decrease over time.

我希望这是有帮助的。

摘要

在这里工作的总体原则，是一个强大的想法，往往在算法设计的作物。它的反向映射。无论何时你发现自己做线性搜索，找到一个给定的属性的项目，考虑建设从属性映射回项目。往往是便宜建设这张图，它大大地减少搜索时间。在premier例子是置换地图 P [I] ，告诉你什么位置我'th元素将占据阵列置换后。如果你需要搜索出该结束了在给定位置的项目 A ，您必须搜索 P 的 A ，线性时间的操作。在另一方面，逆地图 PI ，使得 PI [P [I]] ==我需要不再计算比确实 P （因此它的成本是隐藏），但 PI [A] 生成所需的结果在一定的时间。

The general principle at work here is a powerful idea that crops of often in algorithm design. It's the reverse map. Whenever you find yourself doing a linear search to find an item with a given attribute, consider building a map from the attribute back to item. Often it is cheap to construct this map, and it strongly reduces search time. The premier example is a permutation map p[i] that tells you what position the i'th element will occupy after an array is permuted. If you need to search out the item that ends up in a given location a, you must search p for a, a linear time operation. On the other hand, an inverse map pi such that pi[p[i]] == i takes no longer to compute than does p (so its cost is "hidden"), but pi[a] produces the desired result in constant time.

实施原装海报

import collections import operator def is_power_of_two(n): """Returns True iff n is a power of two. Assumes n > 0.""" return (n & (n - 1)) == 0 def eliminate_subsets(sequence_of_sets): """Return a list of the elements of `sequence_of_sets`, removing all elements that are subsets of other elements. Assumes that each element is a set or frozenset and that no element is repeated.""" # The code below does not handle the case of a sequence containing # only the empty set, so let's just handle all easy cases now. if len(sequence_of_sets) <= 1: return list(sequence_of_sets) # We need an indexable sequence so that we can use a bitmap to # represent each set. if not isinstance(sequence_of_sets, collections.Sequence): sequence_of_sets = list(sequence_of_sets) # For each element, construct the list of all sets containing that # element. sets_containing_element = {} for i, s in enumerate(sequence_of_sets): for element in s: try: sets_containing_element[element] |= 1 << i except KeyError: sets_containing_element[element] = 1 << i # For each set, if the intersection of all of the lists in which it is # contained has length != 1, this set can be eliminated. out = [s for s in sequence_of_sets if s and is_power_of_two(reduce( operator.and_, (sets_containing_element[x] for x in s)))] return out

更多推荐

高效的算法找出所有最大的子集

本文发布于:2023-11-29 21:19:49，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1647660.html