如何使用ANTLR获得正确的编码？

Question

I'm working on a project for school. 我正在为一个学校项目。 We are making a static code analyzer. 我们正在制作一个静态代码分析器。 A requirement for this is to analyse C# code in Java, which is going so far so good with ANTLR. 对此的要求是分析Java中的C＃代码，到目前为止，使用ANTLR效果很好。

I have made some example C# code to scan with ANTLR in Visual Studio. 我已经制作了一些示例C＃代码以在Visual Studio中使用ANTLR进行扫描。 I analyse every C# file in the solution. 我分析解决方案中的每个C＃文件。 But it does not work. 但这行不通。 I am getting a memory leak and the error message : 我收到内存泄漏和错误消息：

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at org.antlr.runtime.Lexer.emit(Lexer.java:151)
    at org.antlr.runtime.Lexer.nextToken(Lexer.java:86)
    at org.antlr.runtime.CommonTokenStream.fillBuffer(CommonTokenStream.java:119)
    at org.antlr.runtime.CommonTokenStream.LT(CommonTokenStream.java:238)

After a while I thought it was an issue with encoding, because all the files are in UTF-8. 过了一会儿，我认为这是编码问题，因为所有文件都位于UTF-8中。 I think it can't read the encoded Stream. 我认为它无法读取编码的Stream。 So i opened Notepad++ and i changed the encoding of every file to ANSI, and then it worked. 因此，我打开了Notepad ++，然后将每个文件的编码更改为ANSI，然后开始工作。 I don't really understand what ANSI means, is this one character set or some kind of organisation? 我真的不明白ANSI是什么意思，这是一个字符集还是某种组织？

I want to change the encoding from any encoding (probably UTF-8) to this ANSI encoding so i won't get memory leaks anymore. 我想将编码从任何编码（可能是UTF-8）更改为这种ANSI编码，这样我就不会再出现内存泄漏了。

This is the code that makes the Lexer and Parser: 这是构成Lexer和Parser的代码：

InputStream inputStream = new FileInputStream(new File(filePath));
CharStream charStream = new ANTLRInputStream(inputStream);
CSharpLexer cSharpLexer = new CSharpLexer(charStream);
CommonTokenStream commonTokenStream = new CommonTokenStream(cSharpLexer);
CSharpParser cSharpParser = new CSharpParser(commonTokenStream);

Does anyone know how to change the encoding of the InputStream to the right encoding? 有谁知道如何将InputStream的编码更改为正确的编码？
And what does Notepad++ do when I change the encoding to ANSI? 当我将编码更改为ANSI时，Notepad ++会做什么？

Answer 1

When reading text files you should set the encoding explicitly. 读取文本文件时，应明确设置编码。 Try you examples with the following change 尝试以下变化的示例

CharStream charStream = new ANTLRInputStream(inputStream, "UTF-8");

Answer 2

I solved this issue by putting the ImputStream into a BufferedStream and then removed the Byte Order Mark. 我通过将ImputStream放入BufferedStream中解决了此问题，然后删除了字节顺序标记。

I guess my parser didn't like that encoding, because I also tried set the encoding explicitly. 我猜我的解析器不喜欢这种编码，因为我也尝试过显式设置编码。

如何使用ANTLR获得正确的编码？

问题描述

2 个解决方案

解决方案1
1 2012-05-03 14:19:37

解决方案2
-1 已采纳 2012-05-09 01:26:00

如何使用ANTLR获得正确的编码？

问题描述

2 个解决方案

解决方案1 1 2012-05-03 14:19:37

解决方案2 -1 已采纳 2012-05-09 01:26:00

解决方案1
1 2012-05-03 14:19:37

解决方案2
-1 已采纳 2012-05-09 01:26:00