我正在努力寻找一个相当复杂的正则表达式来解析歌曲标题,并从松散类型的英语中选择艺术家归属。 用户输入来自单个文本字段,正则表达式匹配将用于查询歌曲数据库以获得唯一的轨道ID。 我需要能够获得这些匹配:
\1 =歌曲标题 \2 =艺术家虽然在允许的格式中相当宽松。
例子
wold“by”应该将字符串分为歌曲标题和艺术家(但仅限于字边界); 应该是带有/不带尾随空格的逗号:
\1 =宝贝再一次 \2 =布兰妮斯皮尔斯宝贝再一次由布兰妮斯皮尔斯
宝贝再一次,布兰妮斯皮尔斯
宝贝再一次,布兰妮斯皮尔斯
像这样的误报是可以接受的:
\1 =下来 \2 =海湾在海湾边
\1 =无论别人怎么说我都是 \2 =那就是我不是不管别人怎么说,那就是我不是
...假设引号可用于将一段文本标记为歌曲标题:
\1 =海湾下来 \2不匹配“沿着海湾”
\1 =无论人们说我是什么,那就是我不是 \2 =北极猴“无论人们说我是谁,这都是我不是”北极猴子
单引号也应该有效,但显然不会出现在标题中:
\1 =无论人们说我是谁,那 \2 = s我不是'“不管别人怎么说,那就是我不是”
此外,如果使用引号,则“by”或逗号是可选的:
\1 =海湾下来 \2 = =拉菲“海湾下来”拉菲
但是,如果没有引号,并且多个“by”,则只应使用最后一个“by”作为分隔符:
\1 =海湾下来 \2 = =拉菲拉菲下到海边
一个正则表达式甚至可以实现这一点吗? 或者更明智的方法是将它分成多个表达式? 无论哪种方式,这看起来像什么?
I am struggling with nailing down a fairly complex regular expression to parse song titles with optional artist attribution from loosely-typed English. The user input comes from a single text field and the regex matches will be used to query a song database to get unique track IDs. I need to be able to get these matches:
\1 = song title \2 = artistwhile being fairly liberal in allowed formats.
Examples
The wold "by" should split the string into song title and artist (but only on word boundaries); as should a comma with/without trailing whitespace:
\1 = baby one more time \2 = britney spearsbaby one more time by britney spears
baby one more time, britney spears
baby one more time,britney spears
False positives like these are acceptable:
\1 = down \2 = the baydown by the bay
\1 = whatever people say i am \2 = that's what i'm notwhatever people say i am, that's what i'm not
…assuming quotes can be used to mark a run of text as a song title explicitly:
\1 = down by the bay \2 not matched"down by the bay"
\1 = whatever people say i am, that's what i'm not \2 = arctic monkeys"whatever people say i am, that's what i'm not" by arctic monkeys
Single quotes should work too, but obviously not if they appear within the title:
\1 = whatever people say i am, that \2 = s what i'm not''whatever people say i am, that's what i'm not'
Additionally, if quotes are in use, the word "by" or a comma are optional:
\1 = down by the bay \2 = raffi"down by the bay" raffi
However, if there are no quotes, and more than one "by", then only the last "by" should be used as a delimiter:
\1 = down by the bay \2 = raffidown by the bay by raffi
Is this even possible with a single regex? Or would the more sane way be to split it up into multiple expressions? Either way, what might this look like?
最满意答案
这是一个使用C#的示例:
var regex = @"^((""(?<title>[^""]+)""|'(?<title>[^']+)')(\s*,\s*|\s+by\s+)?|(?<title>.*)(\s*,\s*|\s+by\s+))\s*(?<artist>.*)$"; var items = new []{ "baby one more time by britney spears", "baby one more time, britney spears", "baby one more time,britney spears", "down by the bay", "whatever people say i am, that's what i'm not", "\"down by the bay\"", "\"whatever people say i am, that's what i'm not\" by arctic monkeys", "'whatever people say i am, that's what i'm not'", "\"down by the bay\" raffi", "down by the bay by raffi", }; foreach (var item in items) { var match = Regex.Match(item, regex, RegexOptions.ExplicitCapture); Console.WriteLine(match.Groups["title"] + " - " + match.Groups["artist"]); }根据我的判断,输出符合您的规格:
baby one more time - britney spears baby one more time - britney spears baby one more time - britney spears down - the bay whatever people say i am - that's what i'm not down by the bay - whatever people say i am, that's what i'm not - arctic monkeys whatever people say i am, that - s what i'm not' down by the bay - raffi down by the bay - raffi实际上,通过允许单词内的撇号,您可以更好地使单引号案例:
var regex = @"^((""(?<title>[^""]+)""|'(?<title>([^']|(?<=\w)'(?=\w))+)')(\s*,\s*|\s+by\s+)?|(?<title>.*)(\s*,\s*|\s+by\s+))\s*(?<artist>.*)$";哪个修复了这种情况:
whatever people say i am, that's what i'm not -这是正则表达式的注释版本,它解释了每个部分的作用(应该与RegexOptions.ExplicitCapture|RegexOptions.IgnorePatternWhitespace匹配):
var regex = @" ^ ( ( ""(?<title>[^""]+)"" (?# matches a double-quote string ) | '(?<title>([^']|(?<=\w)'(?=\w))+)' (?# matches a single-quote string, allowing quotes in words ) ) (\s*,\s*|\s+by\s+)? (?# optionally follow these by ',' or 'by' ) | (?<title>.*)(\s*,\s*|\s+by\s+) (?# otherwise, everything up to ',' or 'by' ) ) \s*(?<artist>.*) (?# everything after this is the artist name ) $";编辑:
我已经玩了很多PHP代码,但我无法正确使用命名捕获组。 这是一个使用未命名捕获组的版本:
$regex = "/^(?:(?:\"([^\"]+)\"|'((?:[^']|(?<=\\w)'(?=\\w))+)')(?:\\s*,\\s*|\\s+by\\s+)?|(.*)(?:\\s*,\\s*|\\s+by\\s+))\s*(.*)\$/"; preg_match($regex, '"down by the river"', $matches); print_r($matches);标题将在第1,2或3组,第4组为艺术家。
Here is an example, using C#:
var regex = @"^((""(?<title>[^""]+)""|'(?<title>[^']+)')(\s*,\s*|\s+by\s+)?|(?<title>.*)(\s*,\s*|\s+by\s+))\s*(?<artist>.*)$"; var items = new []{ "baby one more time by britney spears", "baby one more time, britney spears", "baby one more time,britney spears", "down by the bay", "whatever people say i am, that's what i'm not", "\"down by the bay\"", "\"whatever people say i am, that's what i'm not\" by arctic monkeys", "'whatever people say i am, that's what i'm not'", "\"down by the bay\" raffi", "down by the bay by raffi", }; foreach (var item in items) { var match = Regex.Match(item, regex, RegexOptions.ExplicitCapture); Console.WriteLine(match.Groups["title"] + " - " + match.Groups["artist"]); }Output matches your specification, as far as I can tell:
baby one more time - britney spears baby one more time - britney spears baby one more time - britney spears down - the bay whatever people say i am - that's what i'm not down by the bay - whatever people say i am, that's what i'm not - arctic monkeys whatever people say i am, that - s what i'm not' down by the bay - raffi down by the bay - raffiYou can actually make it better for the single-quote case by allowing apostrophes inside words:
var regex = @"^((""(?<title>[^""]+)""|'(?<title>([^']|(?<=\w)'(?=\w))+)')(\s*,\s*|\s+by\s+)?|(?<title>.*)(\s*,\s*|\s+by\s+))\s*(?<artist>.*)$";Which fixes this case:
whatever people say i am, that's what i'm not -Here's a commented version of the regex, which explains what each part does (should be matched with RegexOptions.ExplicitCapture|RegexOptions.IgnorePatternWhitespace):
var regex = @" ^ ( ( ""(?<title>[^""]+)"" (?# matches a double-quote string ) | '(?<title>([^']|(?<=\w)'(?=\w))+)' (?# matches a single-quote string, allowing quotes in words ) ) (\s*,\s*|\s+by\s+)? (?# optionally follow these by ',' or 'by' ) | (?<title>.*)(\s*,\s*|\s+by\s+) (?# otherwise, everything up to ',' or 'by' ) ) \s*(?<artist>.*) (?# everything after this is the artist name ) $";Edit:
I've played around a bit with the PHP code, but I can't get it to use named capturing groups properly. Here is a version using unnamed capturing groups:
$regex = "/^(?:(?:\"([^\"]+)\"|'((?:[^']|(?<=\\w)'(?=\\w))+)')(?:\\s*,\\s*|\\s+by\\s+)?|(.*)(?:\\s*,\\s*|\\s+by\\s+))\s*(.*)\$/"; preg_match($regex, '"down by the river"', $matches); print_r($matches);The title will be in group 1, 2, or 3, and the artist in group 4.
更多推荐
发布评论