使用正则表达式解析自然语言音乐引文(Parsing Natural Language Music Citations Using Regex)

编程入门 行业动态 更新时间:2024-10-10 08:17:49
使用正则表达式解析自然语言音乐引文(Parsing Natural Language Music Citations Using Regex)

我正在努力寻找一个相当复杂的正则表达式来解析歌曲标题,并从松散类型的英语中选择艺术家归属。 用户输入来自单个文本字段,正则表达式匹配将用于查询歌曲数据库以获得唯一的轨道ID。 我需要能够获得这些匹配:

\1 =歌曲标题 \2 =艺术家

虽然在允许的格式中相当宽松。

例子

wold“by”应该将字符串分为歌曲标题和艺术家(但仅限于字边界); 应该是带有/不带尾随空格的逗号:

宝贝再一次由布兰妮斯皮尔斯

宝贝再一次,布兰妮斯皮尔斯

宝贝再一次,布兰妮斯皮尔斯

\1 =宝贝再一次 \2 =布兰妮斯皮尔斯

像这样的误报是可以接受的:

在海湾边

\1 =下来 \2 =海湾

不管别人怎么说,那就是我不是

\1 =无论别人怎么说我都是 \2 =那就是我不是

...假设引号可用于将一段文本标记为歌曲标题:

“沿着海湾”

\1 =海湾下来 \2不匹配

“无论人们说我是谁,这都是我不是”北极猴子

\1 =无论人们说我是什么,那就是我不是 \2 =北极猴

单引号也应该有效,但显然不会出现在标题中:

“不管别人怎么说,那就是我不是”

\1 =无论人们说我是谁,那 \2 = s我不是'

此外,如果使用引号,则“by”或逗号是可选的:

“海湾下来”拉菲

\1 =海湾下来 \2 = =拉菲

但是,如果没有引号,并且多个“by”,则只应使用最后一个“by”作为分隔符:

拉菲下到海边

\1 =海湾下来 \2 = =拉菲

一个正则表达式甚至可以实现这一点吗? 或者更明智的方法是将它分成多个表达式? 无论哪种方式,这看起来像什么?

I am struggling with nailing down a fairly complex regular expression to parse song titles with optional artist attribution from loosely-typed English. The user input comes from a single text field and the regex matches will be used to query a song database to get unique track IDs. I need to be able to get these matches:

\1 = song title \2 = artist

while being fairly liberal in allowed formats.

Examples

The wold "by" should split the string into song title and artist (but only on word boundaries); as should a comma with/without trailing whitespace:

baby one more time by britney spears

baby one more time, britney spears

baby one more time,britney spears

\1 = baby one more time \2 = britney spears

False positives like these are acceptable:

down by the bay

\1 = down \2 = the bay

whatever people say i am, that's what i'm not

\1 = whatever people say i am \2 = that's what i'm not

…assuming quotes can be used to mark a run of text as a song title explicitly:

"down by the bay"

\1 = down by the bay \2 not matched

"whatever people say i am, that's what i'm not" by arctic monkeys

\1 = whatever people say i am, that's what i'm not \2 = arctic monkeys

Single quotes should work too, but obviously not if they appear within the title:

'whatever people say i am, that's what i'm not'

\1 = whatever people say i am, that \2 = s what i'm not'

Additionally, if quotes are in use, the word "by" or a comma are optional:

"down by the bay" raffi

\1 = down by the bay \2 = raffi

However, if there are no quotes, and more than one "by", then only the last "by" should be used as a delimiter:

down by the bay by raffi

\1 = down by the bay \2 = raffi

Is this even possible with a single regex? Or would the more sane way be to split it up into multiple expressions? Either way, what might this look like?

最满意答案

这是一个使用C#的示例:

var regex = @"^((""(?<title>[^""]+)""|'(?<title>[^']+)')(\s*,\s*|\s+by\s+)?|(?<title>.*)(\s*,\s*|\s+by\s+))\s*(?<artist>.*)$"; var items = new []{ "baby one more time by britney spears", "baby one more time, britney spears", "baby one more time,britney spears", "down by the bay", "whatever people say i am, that's what i'm not", "\"down by the bay\"", "\"whatever people say i am, that's what i'm not\" by arctic monkeys", "'whatever people say i am, that's what i'm not'", "\"down by the bay\" raffi", "down by the bay by raffi", }; foreach (var item in items) { var match = Regex.Match(item, regex, RegexOptions.ExplicitCapture); Console.WriteLine(match.Groups["title"] + " - " + match.Groups["artist"]); }

根据我的判断,输出符合您的规格:

baby one more time - britney spears baby one more time - britney spears baby one more time - britney spears down - the bay whatever people say i am - that's what i'm not down by the bay - whatever people say i am, that's what i'm not - arctic monkeys whatever people say i am, that - s what i'm not' down by the bay - raffi down by the bay - raffi

实际上,通过允许单词内的撇号,您可以更好地使单引号案例:

var regex = @"^((""(?<title>[^""]+)""|'(?<title>([^']|(?<=\w)'(?=\w))+)')(\s*,\s*|\s+by\s+)?|(?<title>.*)(\s*,\s*|\s+by\s+))\s*(?<artist>.*)$";

哪个修复了这种情况:

whatever people say i am, that's what i'm not -

这是正则表达式的注释版本,它解释了每个部分的作用(应该与RegexOptions.ExplicitCapture|RegexOptions.IgnorePatternWhitespace匹配):

var regex = @" ^ ( ( ""(?<title>[^""]+)"" (?# matches a double-quote string ) | '(?<title>([^']|(?<=\w)'(?=\w))+)' (?# matches a single-quote string, allowing quotes in words ) ) (\s*,\s*|\s+by\s+)? (?# optionally follow these by ',' or 'by' ) | (?<title>.*)(\s*,\s*|\s+by\s+) (?# otherwise, everything up to ',' or 'by' ) ) \s*(?<artist>.*) (?# everything after this is the artist name ) $";

编辑:

我已经玩了很多PHP代码,但我无法正确使用命名捕获组。 这是一个使用未命名捕获组的版本:

$regex = "/^(?:(?:\"([^\"]+)\"|'((?:[^']|(?<=\\w)'(?=\\w))+)')(?:\\s*,\\s*|\\s+by\\s+)?|(.*)(?:\\s*,\\s*|\\s+by\\s+))\s*(.*)\$/"; preg_match($regex, '"down by the river"', $matches); print_r($matches);

标题将在第1,2或3组,第4组为艺术家。

Here is an example, using C#:

var regex = @"^((""(?<title>[^""]+)""|'(?<title>[^']+)')(\s*,\s*|\s+by\s+)?|(?<title>.*)(\s*,\s*|\s+by\s+))\s*(?<artist>.*)$"; var items = new []{ "baby one more time by britney spears", "baby one more time, britney spears", "baby one more time,britney spears", "down by the bay", "whatever people say i am, that's what i'm not", "\"down by the bay\"", "\"whatever people say i am, that's what i'm not\" by arctic monkeys", "'whatever people say i am, that's what i'm not'", "\"down by the bay\" raffi", "down by the bay by raffi", }; foreach (var item in items) { var match = Regex.Match(item, regex, RegexOptions.ExplicitCapture); Console.WriteLine(match.Groups["title"] + " - " + match.Groups["artist"]); }

Output matches your specification, as far as I can tell:

baby one more time - britney spears baby one more time - britney spears baby one more time - britney spears down - the bay whatever people say i am - that's what i'm not down by the bay - whatever people say i am, that's what i'm not - arctic monkeys whatever people say i am, that - s what i'm not' down by the bay - raffi down by the bay - raffi

You can actually make it better for the single-quote case by allowing apostrophes inside words:

var regex = @"^((""(?<title>[^""]+)""|'(?<title>([^']|(?<=\w)'(?=\w))+)')(\s*,\s*|\s+by\s+)?|(?<title>.*)(\s*,\s*|\s+by\s+))\s*(?<artist>.*)$";

Which fixes this case:

whatever people say i am, that's what i'm not -

Here's a commented version of the regex, which explains what each part does (should be matched with RegexOptions.ExplicitCapture|RegexOptions.IgnorePatternWhitespace):

var regex = @" ^ ( ( ""(?<title>[^""]+)"" (?# matches a double-quote string ) | '(?<title>([^']|(?<=\w)'(?=\w))+)' (?# matches a single-quote string, allowing quotes in words ) ) (\s*,\s*|\s+by\s+)? (?# optionally follow these by ',' or 'by' ) | (?<title>.*)(\s*,\s*|\s+by\s+) (?# otherwise, everything up to ',' or 'by' ) ) \s*(?<artist>.*) (?# everything after this is the artist name ) $";

Edit:

I've played around a bit with the PHP code, but I can't get it to use named capturing groups properly. Here is a version using unnamed capturing groups:

$regex = "/^(?:(?:\"([^\"]+)\"|'((?:[^']|(?<=\\w)'(?=\\w))+)')(?:\\s*,\\s*|\\s+by\\s+)?|(.*)(?:\\s*,\\s*|\\s+by\\s+))\s*(.*)\$/"; preg_match($regex, '"down by the river"', $matches); print_r($matches);

The title will be in group 1, 2, or 3, and the artist in group 4.

更多推荐

本文发布于:2023-08-07 16:13:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1465306.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:自然语言   引文   音乐   正则表达式   Parsing

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!