Antlr3匹配没有空格的标记(Antlr3 matching tokens without whitespace)

编程入门 行业动态 更新时间:2024-10-18 00:30:41
Antlr3匹配没有空格的标记(Antlr3 matching tokens without whitespace)

给定输入"term >1" ,数字(1)和比较运算符(>)应该在AST中生成单独的节点。 怎么能实现这一目标?

在我的测试匹配中,如果“c”和“1”匹配,则用“ term < 1 ”这样的空格分隔。

目前的语法:

startExpression : orEx; expressionLevel4 : LPARENTHESIS! orEx RPARENTHESIS! | atomicExpression; expressionLevel3 : (fieldExpression) | expressionLevel4 ; expressionLevel2 : (nearExpression) | expressionLevel3 ; expressionLevel1 : (countExpression) | expressionLevel2 ; notEx : (NOT^)? expressionLevel1; andEx : (notEx -> notEx) (AND? a=notEx -> ^(ANDNODE $andEx $a))*; orEx : andEx (OR^ andEx)*; countExpression : COUNT LPARENTHESIS WORD RPARENTHESIS RELATION NUMBERS -> ^(COUNT WORD RELATION NUMBERS); nearExpression : NEAR LPARENTHESIS (WORD|PHRASE) MULTIPLESEPERATOR (WORD|PHRASE) MULTIPLESEPERATOR NUMBERS RPARENTHESIS -> ^(NEAR WORD* PHRASE* ^(NEARDISTANCE NUMBERS)); fieldExpression : WORD PROPERTYSEPERATOR WORD -> ^(FIELDSEARCH ^(TARGETFIELD WORD) WORD ); atomicExpression : WORD | PHRASE ; fragment NUMBER : ('0'..'9'); fragment CHARACTER : ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'?'); fragment QUOTE : ('"'); fragment LESSTHEN : '<'; fragment MORETHEN: '>'; fragment EQUAL: '='; fragment SPACE : ('\u0009'|'\u0020'|'\u000C'|'\u00A0'); fragment UNICODENOSPACES: ('\u0021'..'\u0027'|'\u0030'..'\u0039'|'\u003B'..'\u007E'|'\u00A1'..'\uFFFF'); //fragment UNICODENOSPACES : ('\u0021'..'\u0039'|'\u003B'..'\u007E'|'\u00A1'..'\uFFFF'); LPARENTHESIS : '('; RPARENTHESIS : ')'; AND : ('A'|'a')('N'|'n')('D'|'d'); OR : ('O'|'o')('R'|'r'); ANDNOT : ('A'|'a')('N'|'n')('D'|'d')('N'|'n')('O'|'o')('T'|'t'); NOT : ('N'|'n')('O'|'o')('T'|'t'); COUNT:('C'|'c')('O'|'o')('U'|'u')('N'|'n')('T'|'t'); NEAR:('N'|'n')('E'|'e')('A'|'a')('R'|'r'); PROPERTYSEPERATOR : ':'; MULTIPLESEPERATOR : ','; WS : (SPACE) { $channel=HIDDEN; }; RELATION : LESSTHEN? MORETHEN? EQUAL?; NUMBERS : (NUMBER)+; PHRASE : (QUOTE)(CHARACTER)+((SPACE)+(CHARACTER)+)+(QUOTE); WORD : (UNICODENOSPACES)+;

Given the input "term >1", the number(1) and comparison operator(>) should generate seperate nodes in an AST. How can this be achieved?

In my tests matching only occured if "c" and "1" where seperated with a space like so "term < 1".

Current grammar:

startExpression : orEx; expressionLevel4 : LPARENTHESIS! orEx RPARENTHESIS! | atomicExpression; expressionLevel3 : (fieldExpression) | expressionLevel4 ; expressionLevel2 : (nearExpression) | expressionLevel3 ; expressionLevel1 : (countExpression) | expressionLevel2 ; notEx : (NOT^)? expressionLevel1; andEx : (notEx -> notEx) (AND? a=notEx -> ^(ANDNODE $andEx $a))*; orEx : andEx (OR^ andEx)*; countExpression : COUNT LPARENTHESIS WORD RPARENTHESIS RELATION NUMBERS -> ^(COUNT WORD RELATION NUMBERS); nearExpression : NEAR LPARENTHESIS (WORD|PHRASE) MULTIPLESEPERATOR (WORD|PHRASE) MULTIPLESEPERATOR NUMBERS RPARENTHESIS -> ^(NEAR WORD* PHRASE* ^(NEARDISTANCE NUMBERS)); fieldExpression : WORD PROPERTYSEPERATOR WORD -> ^(FIELDSEARCH ^(TARGETFIELD WORD) WORD ); atomicExpression : WORD | PHRASE ; fragment NUMBER : ('0'..'9'); fragment CHARACTER : ('a'..'z'|'A'..'Z'|'0'..'9'|'*'|'?'); fragment QUOTE : ('"'); fragment LESSTHEN : '<'; fragment MORETHEN: '>'; fragment EQUAL: '='; fragment SPACE : ('\u0009'|'\u0020'|'\u000C'|'\u00A0'); fragment UNICODENOSPACES: ('\u0021'..'\u0027'|'\u0030'..'\u0039'|'\u003B'..'\u007E'|'\u00A1'..'\uFFFF'); //fragment UNICODENOSPACES : ('\u0021'..'\u0039'|'\u003B'..'\u007E'|'\u00A1'..'\uFFFF'); LPARENTHESIS : '('; RPARENTHESIS : ')'; AND : ('A'|'a')('N'|'n')('D'|'d'); OR : ('O'|'o')('R'|'r'); ANDNOT : ('A'|'a')('N'|'n')('D'|'d')('N'|'n')('O'|'o')('T'|'t'); NOT : ('N'|'n')('O'|'o')('T'|'t'); COUNT:('C'|'c')('O'|'o')('U'|'u')('N'|'n')('T'|'t'); NEAR:('N'|'n')('E'|'e')('A'|'a')('R'|'r'); PROPERTYSEPERATOR : ':'; MULTIPLESEPERATOR : ','; WS : (SPACE) { $channel=HIDDEN; }; RELATION : LESSTHEN? MORETHEN? EQUAL?; NUMBERS : (NUMBER)+; PHRASE : (QUOTE)(CHARACTER)+((SPACE)+(CHARACTER)+)+(QUOTE); WORD : (UNICODENOSPACES)+;

最满意答案

这是因为你的WORD规则匹配太多:它也匹配">"所以当">1"被写在一起时,这两个字符被标记为单个WORD -token。

每当我不确定我的词法分析器在做什么时,我就简单地让解析器匹配零个或多个任何类型的标记,并打印所有标记的类型和文本:

parse : (t=. {System.out.printf("\%-15s '\%s'\n", tokenNames[$t.type], $t.text);})* EOF ;

当您让上面的规则与输入"term > 1"匹配时,将打印以下内容:

WORD 'term' RELATION '>' WORD '1'

和输入"term" >1

WORD 'term' WORD '>1'

没有办法解决这个问题:当词法分析器可以匹配2个(或更多)字符( WORD规则)时,它将选择在其之前定义的规则上的路径,该路径仅匹配单个字符( RELATION规则)。

另请注意您的RELATION规则:

RELATION : LESSTHEN? MORETHEN? EQUAL?;

可能匹配空字符串。 确保每个词法分析器规则至少匹配1个字符,否则您的词法分析器可能会进入无限循环。

更好地做这样的事情:

RELATION : (LESSTHEN | MORETHEN)? EQUAL // '<=', '>=', or '=' | (LESSTHEN | MORETHEN) // '<' or '>' ;

That is because your WORD rule matches too much: it also matches ">" so when ">1" are written together, these 2 chars are tokenized as a single WORD-token.

Whenever I'm unsure what my lexer is doing, I simple let the parser match zero or more tokens of any type, and print the type and text of all tokens:

parse : (t=. {System.out.printf("\%-15s '\%s'\n", tokenNames[$t.type], $t.text);})* EOF ;

When you let the rule above match your input "term > 1", the following gets printed:

WORD 'term' RELATION '>' WORD '1'

and of the input "term" >1

WORD 'term' WORD '>1'

There's no way around this: when the lexer can match 2 (or more) characters (the WORD rule), it will choose that path over a rule defined before it which will only match a single char (the RELATION rule).

Also note that your RELATION rule:

RELATION : LESSTHEN? MORETHEN? EQUAL?;

potentially matches the empty string. Make sure every lexer rule matches at least 1 character, otherwise your lexer might get into an infinite loop.

Better do something like this:

RELATION : (LESSTHEN | MORETHEN)? EQUAL // '<=', '>=', or '=' | (LESSTHEN | MORETHEN) // '<' or '>' ;

更多推荐

本文发布于:2023-08-04 13:57:00,感谢您对本站的认可!
本文链接:https://www.elefans.com/category/jswz/34/1416036.html
版权声明:本站内容均来自互联网,仅供演示用,请勿用于商业和其他非法用途。如果侵犯了您的权益请与我们联系,我们将在24小时内删除。
本文标签:空格   标记   whitespace   tokens   matching

发布评论

评论列表 (有 0 条评论)
草根站长

>www.elefans.com

编程频道|电子爱好者 - 技术资讯及电子产品介绍!