简体繁体 English

什么是合适的词法分析器生成器，我可以用来从许多语言源文件中删除标识符？

[英]What is a suitable lexer generator that I can use to strip identifiers from many language source files?

原文 2010-01-22 19:49:10 1 4 java/ parsing/ lexer

I'm working on a group project for my University which is going to be used for plagiarism detection in Computer Science. 我正在为我的大学开展一个小组项目，该项目将用于计算机科学中的抄袭检测。

My group is primarily going off the hashing/fingerprinting techniques described in this journal article: Winnowing: Local Algorithms for Document Fingerprinting . 我的小组主要是关于本期刊文章中描述的散列/指纹识别技术：风选：文档指纹识别的局部算法。 This is very similar to how the MOSS plagiarism detection system works. 这与MOSS抄袭检测系统的工作方式非常相似。

We are basically taking k-gram hashes of fellow students source code and looking them up in a database for relevant matches (along with lots of optimization in how we determine which hashes to select as a document's fingerprints). 我们基本上采用k-gram哈希的同学源代码并在数据库中查找相关匹配（以及我们如何确定选择哪些哈希作为文档指纹的大量优化）。

The first aspect of our project is the "Front-End" portion of it, which will hold some semantic knowledge about each file format our detection system can process. 我们项目的第一个方面是它的“前端”部分，它将保存关于我们的检测系统可以处理的每种文件格式的一些语义知识。 This will allow us to strip some details from the document that we no longer want for the purpose of plagiarism detection. 这将允许我们从文档中删除一些我们不再需要用于抄袭检测的细节。 Basically we want to be able to rename all variables in various programming languages to a constant string or letter. 基本上我们希望能够将各种编程语言中的所有变量重命名为常量字符串或字母。

What is a lightweight solution (lexer generator or something similar) that we can use to aid in renaming all variables in different languages source code files to constants? 什么是轻量级解决方案（词法生成器或类似的东西），我们可以用它来帮助将不同语言的所有变量重命名为源代码文件到常量？

Our project is being written in Java. 我们的项目是用Java编写的。

Ideally I'd simply like to be able to define a grammar for each language and then our front end will be able rename all identifiers in that languages source file to some constant. 理想情况下，我只是希望能够为每种语言定义语法，然后我们的前端将能够将该语言源文件中的所有标识符重命名为常量。 We would then do this for each file format we wanted to support (java, c++, python, etc). 然后，我们将为我们想要支持的每种文件格式（java，c ++，python等）执行此操作。

4 个解决方案

For a lexer/parser generator, you should look at ANTLR. 对于词法分析器/解析器生成器，您应该查看ANTLR。 TXL, which is a textual transformation interpreter, is also worth a look. TXL是一个文本转换解释器，值得一看。 Ready-made grammars should be available for both. 两者都应该有现成的语法。

除了已经建议的ANTLR之外，您还可以看一下JFlex 。

acacia-lex lexer has method replace. acacia-lex lexer有方法替代。

In Lexer token define, what looks like identifiers, for example, "ident1" -> "[a..d]", "ident2" -> "[e..h]". 在Lexer令牌定义中，看起来像标识符，例如“ident1” - >“[a..d]”，“ident2” - >“[e..h]”。

In replace method input map provide the info, which identifier type to replace with which constant (object), for example, "ident1" -> "ident1", "ident2" -> "ident2". 在替换方法输入映射中提供信息，该标识符类型用哪个常量（对象）替换，例如“ident1” - >“ident1”，“ident2” - >“ident2”。

Be aware that there are some languages where it's not really possible to do what you're trying to do. 请注意，有些语言无法完成您正在尝试的操作。 Specifically, those where it's not possible to determine what is or isn't a variable based on the grammar. 具体而言，那些根据语法无法确定变量是什么或不是变量的那些。 Tcl is an example of such, but there are a number of dynamic languages that have the same issue (Lisp?). Tcl就是这样的一个例子，但是有很多动态语言都有相同的问题（Lisp？）。