正则表达式以匹配最长的重复子字符串

编程入门行业动态更新时间:2024-10-26 08:28:40

本文介绍了正则表达式以匹配最长的重复子字符串的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！问题描述

我正在编写正则表达式以检查是否有一个子字符串，该子字符串包含至少2个彼此重复的某种模式.我将正则表达式的结果与以前的字符串匹配-如果相等，则存在这种模式.可以更好地举例说明:1010包含模式10，并且在连续序列中存在2次.另一方面，10210不会有这样的模式，因为那10个不是相邻的.

I'm writing regular expression for checking if there is a substring, that contains at least 2 repeats of some pattern next to each other. I'm matching the result of regex with former string - if equal, there is such pattern. Better said by example: 1010 contains pattern 10 and it is there 2 times in continuous series. On other hand 10210 wouldn't have such pattern, because those 10 are not adjacent.

此外，我需要找到可能的最长模式，并且其长度至少为1.我编写了表达式以检查它是否为^.*?(.+)(\1).*?$.为了找到最长的模式，我在模式之前使用了非贪婪的版本来匹配某些内容，然后将模式匹配到组1，并且再次匹配了已经为组1匹配的相同内容.然后匹配其余的字符串，产生相等的字符串.但是存在一个问题，正则表达式急于在找到第一个模式之后返回，并且并没有真正考虑到我打算在尽可能短的前后使这些子字符串(使其余的字符串尽可能长).因此，从字符串01011010中我可以正确地找到匹配项，但是存储在组1中的模式只是01，尽管我想除101之外.

What's more, I need to find the longest pattern possible, and it's length is at least 1. I have written the expression to check for it ^.*?(.+)(\1).*?$. To find longest pattern, I've used non-greedy version to match something before patter, then pattern is matched to group 1 and once again same thing that has been matched for group1 is matched. Then the rest of string is matched, producing equal string. But there's a problem that regex is eager to return after finding first pattern, and don't really take into account that I intend to make those substrings before and after shortest possible (leaving the rest longest possible). So from string 01011010 I get correctly that there's match, but the pattern stored in group 1 is just 01 though I'd except 101.

因为我相信我无法在更贪婪"之前和之后都不会更贪婪"或浪费"模式，所以我只能想出一个使正则表达式不那么渴望的想法，但是我不确定这是否是可能的.

As I believe I can't make pattern "more greedy" or trash before and after even "more non-greedy" I can only come whit an idea to make regex less eager, but I'm not sure if this is possible.

更多示例:

56712453289 - no pattern - no match with former string 22010110100 - pattern 101 - match with former string (regex resulted in 22010110100 with 101 in group 1) 5555555 - pattern 555 - match 1919191919 - pattern 1919 - match 191919191919 - pattern 191919 - match 2323191919191919 - pattern 191919 - match

使用当前表达式(使用相同的字符串)会得到什么:

What I would get using current expression (same strings used):

no pattern - no match pattern 2 - match pattern 555 - match pattern 1919 - match pattern 191919 - match pattern 23 - match

推荐答案

在Perl中，您可以借助 (??{ code }) :

In Perl you can do it with one expression with help of (??{ code }):

$_ = '01011010'; say /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/;

输出:

101

这里发生的是，在匹配了连续的一对子字符串之后，我们以负前瞻的方式确保不再跟随子对.

What happens here is that after a matching consecutive pair of substrings, we make sure with a negative lookahead that there is no longer pair following it.

为使更长的对的表达式更易使用，使用了延迟的子表达式构造(??{ code })，该子表达式每次(在每次操作中)对内部代码进行求值，并将返回的字符串用作表达式.

To make the expression for the longer pair a postponed subexpression construct is used (??{ code }), which evaluates the code inside (every time) and uses the returned string as an expression.

它构造的子表达式的格式为.+?(..{N,})\1，其中N是第一个捕获组的当前长度(length($^N)，$^N包含前一个捕获组的当前值).

The subexpression it constructs has the form .+?(..{N,})\1, where N is the current length of the first capturing group (length($^N), $^N contains the current value of the previous capturing group).

因此完整的表达式将具有以下形式:

Thus the full expression would have the form:

(?=(.+)\1)(?!.+?(..{N,})\2}))

具有神奇的N(第二个捕获组不是原始表达的真实"/正确捕获组).

With the magical N (and second capturing group not being a "real"/proper capturing group of the original expression).

用法示例:

use v5.10; sub longest_rep{ $_[0] =~ /(?=(.+)\1)(?!(??{ '.+?(..{' . length($^N) . ',})\1' }))/; } say longest_rep '01011010'; say longest_rep '010110101000110001'; say longest_rep '2323191919191919'; say longest_rep '22010110100';

输出:

101 10001 191919 101

更多推荐

正则表达式以匹配最长的重复子字符串

本文发布于:2023-11-30 01:07:48，感谢您对本站的认可！

本文链接:https://www.elefans.com/category/jswz/34/1648209.html