简体   繁体   English

如何使用ANTLR获得正确的编码?

[英]How do I get this encoding right with ANTLR?

I'm working on a project for school. 我正在为一个学校项目。 We are making a static code analyzer. 我们正在制作一个静态代码分析器。 A requirement for this is to analyse C# code in Java, which is going so far so good with ANTLR. 对此的要求是分析Java中的C#代码,到目前为止,使用ANTLR效果很好。

I have made some example C# code to scan with ANTLR in Visual Studio. 我已经制作了一些示例C#代码以在Visual Studio中使用ANTLR进行扫描。 I analyse every C# file in the solution. 我分析解决方案中的每个C#文件。 But it does not work. 但这行不通。 I am getting a memory leak and the error message : 我收到内存泄漏和错误消息:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at org.antlr.runtime.Lexer.emit(Lexer.java:151)
    at org.antlr.runtime.Lexer.nextToken(Lexer.java:86)
    at org.antlr.runtime.CommonTokenStream.fillBuffer(CommonTokenStream.java:119)
    at org.antlr.runtime.CommonTokenStream.LT(CommonTokenStream.java:238)

After a while I thought it was an issue with encoding, because all the files are in UTF-8. 过了一会儿,我认为这是编码问题,因为所有文件都位于UTF-8中。 I think it can't read the encoded Stream. 我认为它无法读取编码的Stream。 So i opened Notepad++ and i changed the encoding of every file to ANSI, and then it worked. 因此,我打开了Notepad ++,然后将每个文件的编码更改为ANSI,然后开始工作。 I don't really understand what ANSI means, is this one character set or some kind of organisation? 我真的不明白ANSI是什么意思,这是一个字符集还是某种组织?

I want to change the encoding from any encoding (probably UTF-8) to this ANSI encoding so i won't get memory leaks anymore. 我想将编码从任何编码(可能是UTF-8)更改为这种ANSI编码,这样我就不会再出现内存泄漏了。

This is the code that makes the Lexer and Parser: 这是构成Lexer和Parser的代码:

InputStream inputStream = new FileInputStream(new File(filePath));
CharStream charStream = new ANTLRInputStream(inputStream);
CSharpLexer cSharpLexer = new CSharpLexer(charStream);
CommonTokenStream commonTokenStream = new CommonTokenStream(cSharpLexer);
CSharpParser cSharpParser = new CSharpParser(commonTokenStream);
  • Does anyone know how to change the encoding of the InputStream to the right encoding? 有谁知道如何将InputStream的编码更改为正确的编码?
  • And what does Notepad++ do when I change the encoding to ANSI? 当我将编码更改为ANSI时,Notepad ++会做什么?

When reading text files you should set the encoding explicitly. 读取文本文件时,应明确设置编码。 Try you examples with the following change 尝试以下变化的示例

CharStream charStream = new ANTLRInputStream(inputStream, "UTF-8");

I solved this issue by putting the ImputStream into a BufferedStream and then removed the Byte Order Mark. 我通过将ImputStream放入BufferedStream中解决了此问题,然后删除了字节顺序标记。

I guess my parser didn't like that encoding, because I also tried set the encoding explicitly. 我猜我的解析器不喜欢这种编码,因为我也尝试过显式设置编码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM