最长重复(k次)子字符串

编程入门 行业动态 更新时间:2024-10-15 12:29:05
本文介绍了最长重复(k次)子字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧! 问题描述

我知道这是一个比较棘手的话题,但是我已经达到了可以从已经回答的问题中获得帮助的极限。

I know this is a somewhat beaten topic, but I have reached the limit of help I can get from what's already been answered.

这是针对 Rosalind项目问题LREP 。我试图在字符串中找到最长的k字串子字符串,并且提供了后缀树,这很好。我知道我需要用每个节点的后代叶子数注释后缀表,然后找到具有> = k 后代的节点,最后找到其中最深的节点节点。从理论上讲,我已经定了。

This is for the Rosalind project problem LREP. I'm trying to find the longest k-peated substring in a string and I've been provided the suffix tree, which is nice. I know that I need to annotate the suffix table with the number of descendant leaves from each node, then find nodes with >=k descendants, and finally find the deepest of those nodes. Theory-wise I'm set.

我从以下资源中获得了很多帮助(糟糕,我只能发布2条信息):

I've gotten a lot of help from the following resources (oops, I can only post 2):

  • 查找字符串中最长的重复序列
  • 深度优先搜索(Python)
  • Find longest repetitive sequence in a string
  • Depth-first search (Python)

我可以获得从根到每个叶子的路径,但是我可以没有弄清楚如何对树进行预处理,这样我就可以从每个节点中获取后代的数量。我有一个单独的算法,可以处理较小的序列,但是它的计算复杂度是指数级的,因此对于较大的算法,它花费的时间太长。我知道使用DFS,我应该能够以线性复杂度执行整个任务。为了使该算法起作用,我需要能够在不到5分钟的时间内获得〜40,000长度的字符串中最长的k-peat。

I can get the paths from the root to each leaf, but I can't figure out how to pre-process the tree in such a way that I can get the number of descendants from each node. I have a separate algorithm that works on small sequences but it's in exponential complexity, so for larger stuff it takes way too long. I know with a DFS I should be able to perform the whole task in linear complexity. For this algorithm to work I need to be able to get the longest k-peat of an ~40,000 length string in less than 5 minutes.

以下是一些示例数据(第一个行:序列,第二行: k ,后缀表格式:父子位置长度):

Here's some sample data (first line: sequence, second line: k, suffix table format: parent child location length):

CATACATAC$ 2 1 2 1 1 1 7 2 1 1 14 3 3 1 17 10 1 2 3 2 4 2 6 10 1 3 4 6 5 3 5 10 1 7 8 3 3 7 11 5 1 8 9 6 5 8 10 10 1 11 12 6 5 11 13 10 1 14 15 6 5 14 16 10 1

此输出应该是 CATAC 。

使用以下代码(从 LiteratePrograms )我已经能够获取路径,但是在较长的序列上解析路径仍然需要很长时间

With the following code (modified from LiteratePrograms) I've been able to get the paths, but it still takes a long time on longer sequences to parse out a path for each node.

#authors listed at #en.literateprograms/Depth-first_search_(Python)?action=history&offset=20081013235803 class Vertex: def __init__(self, data): self.data = data self.successors = [] def depthFirstSearch(start, isGoal, result): if start in result: return False result.append(start) if isGoal(start): return True for v in start.successors: if depthFirstSearch(v, isGoal, result): return True # No path was found result.pop() return False def lrep(seq,reps,tree): n = 2 * len(seq) - 1 v = [Vertex(i) for i in xrange(n)] edges = [(int(x[0]),int(x[1])) for x in tree] for a, b in edges: v[a].successors.append(v[b]) paths = {} for x in v: result = [] paths[x.data] = [] if depthFirstSearch(v[1], (lambda v: v.data == x.data), result): path = [u.data for u in result] paths[x.data] = path

我想做的是对树进行预处理,以找到满足后代>的节点。 = k 的要求,然后再找到深度。我什至还不了解如何计算深度。尽管我想我会拥有一些字典来跟踪路径中每个节点的深度,然后求和。

What I'd like to do is pre-process the tree to find nodes which satisfy the descendants >= k requirement prior to finding the depth. I haven't even gotten to how I'm going to calculate depth yet. Though I imagine I'll have some dictionary to keeps track of the depths of each node in the path then sums.

因此,我最重要的问题是: 如何预处理带有后代叶子的树?

So, my first-most-important question is: "How do I preprocess the tree with descendant leaves?"

我的次要的重要问题是:如何快速计算深度?

My second-less-important question is: "After that, how can I quickly compute depth?"

PS我应该说这不是作业或任何类似的事情。我只是一个生物化学家,试图通过一些计算挑战来拓展我的视野。

P.S. I should state that this isn't homework or anything of that sort. I'm just a biochemist trying to expand my horizons with some computational challenges.

推荐答案

关于基本字符串练习的好问题-操作。我不再记得后缀树了;)但是正如您所说的:从理论上讲,您已经设置好了。

Nice question for an excercise in basic string-operations. I didnt remember the suffix-tree anymore ;) But as you have stated: theory-wise, you are set.

该主题的 wikipedia-stub 有点混乱。您只需要知道自己是否是最外面的非叶节点,其子节点为 n> = k 。如果您在整个字符串中找到了从根节点到该节点的子字符串,则后缀树会告诉您,有 n 个可能的延续。因此必须在 n 个出现此字符串的位置。

The wikipedia-stub onto this topic is a bit confusing. You only need to know, if you are the outermost non-leaf-node with n >= k childs. If you found the substring from the root-node to this one in the whole string, the suffix-tree tells you, that there are n possible continuitations. So there must be n places, where this string occurs.

对此和许多类似问题的简单关键概念是进行深度优先搜索:在每个Node中,询问子元素的值并返回最大的给父母。根节点将获得最终结果。

A simple key-concept of this and many similar problems is to do a depth-first-search: In every Node, ask the child-elements for their value and return the maximum of it to the parent. The root-node will get the final result.

这些问题之间值的计算方式有所不同。在这里,每个节点都有三种可能性:

How the values are calculated differs between the problems. Here you have three possiblilitys for every node:

  • 该节点没有子节点。它是一个叶节点,结果无效。
  • 每个孩子都返回一个无效结果。其最后一个非叶子节点,结果为零(此节点之后没有更多字符)。如果此节点有 n 个子节点,则从根到此节点的每个边的缩进字符串在该节点中出现 n 次。整个字符串。如果我们至少需要 k 个节点和 k> n ,结果也无效。
  • 一个或多个叶子返回有效值。结果是返回值的最大值加上附加有边缘的字符串的长度。
  • The node have no childs. Its a leaf-node, the result is invalid.
  • Every child returns an invalid result. Its the last non-leaf-node, the result is zero (no more characters after this node). If this node have n childs, the concated string of every edge from the root to this node appears n times in the whole string. If we need at least k nodes and k > n, the result is also invalid.
  • One or more leafs return something valid. The result is the maximum of the returned value plus the length of the string attached the edge to it.
  • 当然,您还必须返回对应的结点。否则,您将知道最长的重复子字符串多长时间,但不知道它在哪里。

    Of course, you also have to return the correspondending node. Otherwise you will know, how long the longest repeated substring is but not where it is.

    您应该首先尝试自己编写代码。如果您想收集所有必要的信息,则构造树很简单,但并不简单。不过,这是一个简单的示例。请注意:如果输入某种程度上无效,则将放弃所有的完整性检查,并且一切都会可怕地失败。例如。不要尝试使用除根索引之外的任何其他根索引,不要将节点引用为父节点,而以前未将其引用为子节点,等等。还有很大的改进空间* 提示;)*

    You should try to code this by yourself first. Constructing the tree is simple but not trivial if you want to gather all necessary informations. Nevertheless here is a simple example. Please note: every sanity-checking is dropped out and everything will fail horribly, if the input is somehow invalid. E.g. do not try to use any other root-index than one, do not refere to nodes as a parent, which weren't referenced as a childs before, etc. Much room for improvement *hint;)*.

    class Node(object): def __init__(self, idx): self.idx = idx # not needed but nice for prints self.parent = None # edge to parent or None self.childs = [] # list of edges def get_deepest(self, k = 2): max_value = -1 max_node = None for edge in self.childs: r = edge.n2.get_deepest() if r is None: continue # leaf value, node = r value += len(edge.s) if value > max_value: # new best result max_value = value max_node = node if max_node is None: # we are either a leaf (no edge connected) or # the last non-leaf. # The number of childs have to be k to be valid. return (0, self) if len(self.childs) == k else None else: return (max_value, max_node) def get_string_to_root(self): if self.parent is None: return "" return self.parent.n1.get_string_to_root() + self.parent.s class Edge(object): # creating the edge also sets the correspondending # values in the nodes def __init__(self, n1, n2, s): #print "Edge %d -> %d [ %s]" % (n1.idx, n2.idx, s) self.n1, self.n2, self.s = n1, n2, s n1.childs.append(self) n2.parent = self nodes = {1 : Node(1)} # root-node string = sys.stdin.readline() k = int(sys.stdin.readline()) for line in sys.stdin: parent_idx, child_idx, start, length = [int(x) for x in line.split()] s = string[start-1:start-1+length] # every edge constructs a Node nodes[child_idx] = Node(child_idx) Edge(nodes[parent_idx], nodes[child_idx], s) (depth, node) = nodes[1].get_deepest(k) print node.get_string_to_root()

    更多推荐

    最长重复(k次)子字符串

    本文发布于:2023-11-29 19:58:56,感谢您对本站的认可!
    本文链接:https://www.elefans.com/category/jswz/34/1647475.html
    版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
    本文标签:字符串   最长

    发布评论

    评论列表 (有 0 条评论)
    草根站长

    >www.elefans.com

    编程频道|电子爱好者 - 技术资讯及电子产品介绍!