如果我有可变数量的集合(我们叫数字 n ),每个集合最多具有 m 个元素,那么计算成对的最有效方法是什么对所有集合的交集?请注意,这不同于所有 n 集的交集.
If I have a variable number of sets (let's call the number n), which have at most m elements each, what's the most efficient way to calculate the pairwise intersections for all pairs of sets? Note that this is different from the intersection of all n sets.
例如,如果我有以下几组:
For example, if I have the following sets:
A={"a","b","c"} B={"c","d","e"} C={"a","c","e"}我希望能够找到:
intersect_AB={"c"} intersect_BC={"c", "e"} intersect_AC={"a", "c"}另一种可接受的格式(如果使事情变得更容易)将是给定集合中的项目到包含相同项目的集合的映射.例如:
Another acceptable format (if it makes things easier) would be a map of items in a given set to the sets that contain that same item. For example:
intersections_C={"a": {"A", "C"}, "c": {"A", "B", "C"} "e": {"B", "C"}}我知道这样做的一种方法是创建一个字典,将所有 n 个集合的并集中的每个值映射到出现它的集合的列表,然后遍历所有这些值创建上面的intersections_C之类的列表,但是我不确定随着 n 的增加和集合的大小变得太大而如何缩放.
I know that one way to do so would be to create a dictionary mapping each value in the union of all n sets to a list of the sets in which it occurs and then iterate through all of those values to create lists such as intersections_C above, but I'm not sure how that scales as n increases and the sizes of the set become too large.
一些其他背景信息:
推荐答案
这应该做您想要的
import random as RND import string import itertools as IT模拟一些数据
fnx = lambda: set(RND.sample(string.ascii_uppercase, 7)) S = [fnx() for c in range(5)]生成S中集合的索引列表,以便可以在下面更简洁地引用这些集合
idx = range(len(S))获取S中所有可能的项目对.但是,由于集合交集是可交换的,我们希望使用组合而不是置换
get all possible unique pairs of the items in S; however, since set intersection is commutative, we want the combinations rather than permutations
pairs = ITbinations(idx, 2)编写函数以执行设置交点
nt = lambda a, b: S[a].intersection(S[b])将此功能折叠到对上&将每个函数调用的结果键入其参数
res = dict([ (t, nt(*t)) for t in pairs ])下面的结果是按照OP中引用的第一个选项格式化的,是一个字典,其中 values 是两个序列的设定交集;每个值键到一个由这些序列的两个索引组成的元组
the result below, formatted per the first option recited in the OP, is a dictionary in which the values are the set intersections of two sequences; each values keyed to a tuple comprised of the two indices of those sequences
这个解决方案实际上只是两行代码:(i)计算排列; (ii)然后对每个排列应用一些函数,将返回值存储在结构化容器(键值)容器中
this solution, is really just two lines of code: (i) calculate the permutations; (ii) then apply some function over each permutation, storing the returned value in a structured container (key-value) container
此解决方案的内存占用量很小,但是您可以通过在最后一步返回生成器表达式来实现更好的效果,即
the memory footprint of this solution is minimal, but you can do even better by returning a generator expression in the last step, ie
res = ( (t, nt(*t)) for t in pairs )请注意,使用这种方法,对的序列和相应的交集都没有写到内存中,即,对和 res 都是迭代器.
notice that with this approach, neither the sequence of pairs nor the corresponding intersections have been written out in memory--ie, both pairs and res are iterators.
更多推荐
Python中的成对设置相交
发布评论