简体   繁体   English

ANTLR:Unicode字符扫描

[英]ANTLR: Unicode Character Scanning

Problem: Can't get Unicode character to print correctly. 问题:无法获取Unicode字符以正确打印。

Here is my grammar: 这是我的语法:

options { k=1; filter=true;
 // Allow any char but \uFFFF (16 bit -1)
charVocabulary='\u0000'..'\uFFFE'; 
}

ANYCHAR :'$'
|    '_' { System.out.println("Found underscore: "+getText()); }
|    'a'..'z' { System.out.println("Found alpha: "+getText()); }
|    '\u0080'..'\ufffe' { System.out.println("Found unicode: "+getText()); }
; 

Code snippet of main method invoking the lexer: 调用词法分析器的主要方法的代码片段:

public static void main(String[] args) {
SimpleLexer simpleLexer = new SimpleLexer(System.in);
while(true) {
try {
Token t = simpleLexer.nextToken();
System.out.println("Token : "+t);

} catch(Exception e) {}

}
}

For input "ठ" , I'm getting the following output : 对于输入“ठ” ,我得到以下输出:

Found unicode: 
Token : ["à",<5>,line=1,col=7]
Found unicode: 
Token : ["¤",<5>,line=1,col=8]
Found unicode:  
Token : [" ",<5>,line=1,col=9]

It appears that the lexer is treating Unicode char "ठ" as three separate character. 似乎该词法分析器将Unicode字符“ठ”视为三个独立的字符。 My aim is to scan and print "ठ". 我的目的是扫描并打印“ठ”。

Your problem is not in the ANTLR generated lexer, but in the Java stream you pass to it. 您的问题不在ANTLR生成的词法分析器中,而是在Java流中传递给它。 The stream reads bytes only (doesn't interpret them in an encoding), and what you see is an UTF-8 sequence. 该流仅读取字节(不以编码方式解释字节),并且您看到的是UTF-8序列。

If its ANTLR 3, you can use the ANTLRInputStream constructor that takes an ancoding as a parameter: 如果是ANTLR 3,则可以使用以ancoding作为参数的ANTLRInputStream构造函数:

ANTLRInputStream (InputStream input, String encoding) throws IOException

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM