ANTLR：Unicode字符扫描

Question

Problem: Can't get Unicode character to print correctly. 问题：无法获取Unicode字符以正确打印。

Here is my grammar: 这是我的语法：

options { k=1; filter=true;
 // Allow any char but \uFFFF (16 bit -1)
charVocabulary='\u0000'..'\uFFFE'; 
}

ANYCHAR :'$'
|    '_' { System.out.println("Found underscore: "+getText()); }
|    'a'..'z' { System.out.println("Found alpha: "+getText()); }
|    '\u0080'..'\ufffe' { System.out.println("Found unicode: "+getText()); }
;

Code snippet of main method invoking the lexer: 调用词法分析器的主要方法的代码片段：

public static void main(String[] args) {
SimpleLexer simpleLexer = new SimpleLexer(System.in);
while(true) {
try {
Token t = simpleLexer.nextToken();
System.out.println("Token : "+t);

} catch(Exception e) {}

}
}

For input "ठ" , I'm getting the following output : 对于输入“ठ” ，我得到以下输出：

Found unicode: 
Token : ["à",<5>,line=1,col=7]
Found unicode: 
Token : ["¤",<5>,line=1,col=8]
Found unicode:  
Token : [" ",<5>,line=1,col=9]

It appears that the lexer is treating Unicode char "ठ" as three separate character. 似乎该词法分析器将Unicode字符“ठ”视为三个独立的字符。 My aim is to scan and print "ठ". 我的目的是扫描并打印“ठ”。

Answer 1

Your problem is not in the ANTLR generated lexer, but in the Java stream you pass to it. 您的问题不在ANTLR生成的词法分析器中，而是在Java流中传递给它。 The stream reads bytes only (doesn't interpret them in an encoding), and what you see is an UTF-8 sequence. 该流仅读取字节（不以编码方式解释字节），并且您看到的是UTF-8序列。

If its ANTLR 3, you can use the ANTLRInputStream constructor that takes an ancoding as a parameter: 如果是ANTLR 3，则可以使用以ancoding作为参数的ANTLRInputStream构造函数：

ANTLRInputStream (InputStream input, String encoding) throws IOException

ANTLR：Unicode字符扫描

问题描述

1 个解决方案

解决方案1
6 已采纳 2010-09-02 22:20:54

ANTLR：Unicode字符扫描

问题描述

1 个解决方案

解决方案1 6 已采纳 2010-09-02 22:20:54

解决方案1
6 已采纳 2010-09-02 22:20:54