简体   繁体   English

如何匹配antlr中的unicode字符

[英]How do I match unicode characters in antlr

I am trying to pick out all tokens in a text and need to match all Ascii and Unicode characters, so here is how I have laid them out. 我试图在文本中挑选所有标记,并且需要匹配所有Ascii和Unicode字符,所以这就是我如何将它们排除在外。

fragment CHAR     :  ('A'..'Z') | ('a'..'z');
fragment DIGIT    :  ('0'..'9');
fragment UNICODE  :  '\u0000'..'\u00FF';

Now if I write my token rule as: 现在,如果我将令牌规则写为:

TOKEN  :  (CHAR|DIGIT|UNICODE)+;

I get "Decision can match input such as "'A'..'Z'" using multiple alternatives: 1, 3 As a result, alternative(s) 3 were disabled for that input" " Decision can match input such as "'0'..'9'" using multiple alternatives: 2, 3 As a result, alternative(s) 3 were disabled for that input" 我得到“决定可以匹配输入,例如”'A'..'Z'“使用多种选择:1,3结果,替代(s)3被禁用该输入”“决定可以匹配输入,如”' 0'..'9'“使用多个替代方案:2,3结果,替代(s)3被禁用该输入”

And nothing gets matched: And also if I write it as 没有任何东西可以匹配:而且如果我把它写成

TOKEN  :  (UNICODE)+;

Nothing gets matched. 什么都没有匹配。

Is there a way of doing this. 有没有办法做到这一点。

One other thing to consider if you are planning on using Unicode is that you should set the charvocabulary option to say that you want to allow any char in the Unicode range of 0 through FFFE 如果您计划使用Unicode,另外要考虑的另一个问题是您应该设置charvocabulary选项以表示您要允许Unicode范围为0到FFFE中的任何char

options
{
charVocabulary='\u0000'..'\uFFFE';
}

The default you'll usually see in the examples is 您通常会在示例中看到的默认值是

options
{
charVocabulary = '\3'..'\377';
}

To cover the point made above. 涵盖上述观点。 Generally if you needed both the ascii character range 'A'..'Z' and the unicode range you'd make a unicode lexer rule like: '\€'..'\￾' 通常,如果您需要ascii字符范围'A'..'Z'和unicode范围,您将制作一个unicode词法分析器规则,如: '\€'..'\￾'

Practically speaking, TOKEN: (UNICODE)+ is completely useless. 实际上, TOKEN: (UNICODE)+完全没用。

Since everything is a token character, if you try to use such a rule to match a Java program, say, it will simply match the whole program and return it to you as one big token. 由于所有内容都是令牌字符,因此,如果您尝试使用此类规则来匹配Java程序,那么它将简单地匹配整个程序并将其作为一个大令牌返回给您。

You really do need to break your characters down into different groups if you want to split your input apart into meaningful fragments. 如果你想将你的输入分成有意义的片段,你真的需要将你的角色分成不同的组。

It might help you to take a look at how the "pros" have done it. 它可能会帮助您了解“专业人士”是如何做到的。 Here is a BNF grammar for Java , and here is BNF for an identifier , which shows how they took to the trouble to group out 这是Java的BNF语法 ,这里有一个标识符的BNF ,它显示了他们如何解决问题

identifier 
  ::= "a..z,$,_" { "a..z,$,_,0..9,unicode character over 00C0" } 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM