我有一个字符串:
string = u'11a2ee22b333c44d5e66e777e8888'我想找到所有k个连续的数字块,其中n <= k <= m 。
仅使用正则表达式:例如n=2和m=3使用(?:\D|^)(\d{2,3})(?:\D|$)
re.findall(u'(?:\D|^)(\d{2,3})(?:\D|$)',u'11a2ee22b333c44d5e66e777e8888')给出这个输出:
['11', '333', '66']期望的输出:
['11', '22', '333', '44', '66', '777']我知道有其他解决方案,如:
filter(lambda x: re.match('^\d{2,3}$', x), re.split(u'\D',r'11a2ee22b333c44d5e66e777e8888'))它提供了所需的输出,但我想知道第一种方法有什么问题?
似乎re.findall按顺序排列并在匹配时跳过前一部分,那么可以做些什么呢?
I have a string:
string = u'11a2ee22b333c44d5e66e777e8888'I want to find all k consecutive chunks of digits where n <= k <= m.
Using regular expression only: say for example n=2 and m=3 using (?:\D|^)(\d{2,3})(?:\D|$)
re.findall(u'(?:\D|^)(\d{2,3})(?:\D|$)',u'11a2ee22b333c44d5e66e777e8888')Gives this output:
['11', '333', '66']Desired output:
['11', '22', '333', '44', '66', '777']I know there are alternate solutions like:
filter(lambda x: re.match('^\d{2,3}$', x), re.split(u'\D',r'11a2ee22b333c44d5e66e777e8888'))which gives the desired output, but I want to know what's wrong with the first approach?
It seems re.findall goes in sequence and skips the previous part when matched, so what can be done?
最满意答案
注意:您在问题中显示的结果不是我得到的结果:
>>> import re >>> re.findall(u'(?:\D|^)(\d{2,3})(?:\D|$)',u'11a2ee22b333c44d5e66e777e8888') [u'11', u'22', u'44', u'66']它仍然缺少你想要的一些比赛,但不是相同的。
问题是即使像(?:\D|^)和(?:\D|$)这样的非捕获组没有捕获它们匹配的内容,它们仍然会使用它。
这意味着产生'22'的匹配实际消耗了:
e ,带(?:\D|^) - 未捕获(但仍然消耗) 22与(\d{2,3}) - 被捕获 b与(?:\D|$) - 未捕获(但仍然消耗)...所以b在333之前不再可以匹配。
您可以使用lookbehind和lookahead语法获得所需的结果:
>>> re.findall(u'(?<!\d)\d{2,3}(?!\d)',u'11a2ee22b333c44d5e66e777e8888') [u'11', u'22', u'333', u'44', u'66', u'777']在这里, (?<!\d)是负面的后视,检查匹配是否前面没有数字, (?!\d)是否为前瞻,检查匹配后面没有数字。 至关重要的是,这些结构不会消耗任何字符串。
Python的re文档的正则表达式语法部分描述了各种先行和后视结构。
Note: The result you show in your question is not what I'm getting:
>>> import re >>> re.findall(u'(?:\D|^)(\d{2,3})(?:\D|$)',u'11a2ee22b333c44d5e66e777e8888') [u'11', u'22', u'44', u'66']It's still missing some of the matches you want, but not the same ones.
The problem is that even though non-capturing groups like (?:\D|^) and (?:\D|$) don't capture what they match, they still consume it.
This means that the match which yields '22' has actually consumed:
e, with (?:\D|^) – not captured (but still consumed) 22 with (\d{2,3}) – captured b with (?:\D|$) – not captured (but still consumed)… so that b is no longer available to be matched before 333.
You can get the result you want with lookbehind and lookahead syntax:
>>> re.findall(u'(?<!\d)\d{2,3}(?!\d)',u'11a2ee22b333c44d5e66e777e8888') [u'11', u'22', u'333', u'44', u'66', u'777']Here, (?<!\d) is a negative lookbehind, checking that the match is not preceded by a digit, and (?!\d) is a negative lookahead, checking that the match is not followed by a digit. Crucially, these constructions do not consume any of the string.
The various lookahead and lookbehind constructions are described in the Regular Expression Syntax section of Python's re documentation.
更多推荐
发布评论