简体   繁体   English

什么是合适的词法分析器生成器,我可以用来从许多语言源文件中删除标识符?

[英]What is a suitable lexer generator that I can use to strip identifiers from many language source files?

I'm working on a group project for my University which is going to be used for plagiarism detection in Computer Science. 我正在为我的大学开展一个小组项目,该项目将用于计算机科学中的抄袭检测。

My group is primarily going off the hashing/fingerprinting techniques described in this journal article: Winnowing: Local Algorithms for Document Fingerprinting . 我的小组主要是关于本期刊文章中描述的散列/指纹识别技术: 风选:文档指纹识别的局部算法 This is very similar to how the MOSS plagiarism detection system works. 这与MOSS抄袭检测系统的工作方式非常相似。

We are basically taking k-gram hashes of fellow students source code and looking them up in a database for relevant matches (along with lots of optimization in how we determine which hashes to select as a document's fingerprints). 我们基本上采用k-gram哈希的同学源代码并在数据库中查找相关匹配(以及我们如何确定选择哪些哈希作为文档指纹的大量优化)。

The first aspect of our project is the "Front-End" portion of it, which will hold some semantic knowledge about each file format our detection system can process. 我们项目的第一个方面是它的“前端”部分,它将保存关于我们的检测系统可以处理的每种文件格式的一些语义知识。 This will allow us to strip some details from the document that we no longer want for the purpose of plagiarism detection. 这将允许我们从文档中删除一些我们不再需要用于抄袭检测的​​细节。 Basically we want to be able to rename all variables in various programming languages to a constant string or letter. 基本上我们希望能够将各种编程语言中的所有变量重命名为常量字符串或字母。

What is a lightweight solution (lexer generator or something similar) that we can use to aid in renaming all variables in different languages source code files to constants? 什么是轻量级解决方案(词法生成器或类似的东西),我们可以用它来帮助将不同语言的所有变量重命名为源代码文件到常量?

Our project is being written in Java. 我们的项目是用Java编写的。

Ideally I'd simply like to be able to define a grammar for each language and then our front end will be able rename all identifiers in that languages source file to some constant. 理想情况下,我只是希望能够为每种语言定义语法,然后我们的前端将能够将该语言源文件中的所有标识符重命名为常量。 We would then do this for each file format we wanted to support (java, c++, python, etc). 然后,我们将为我们想要支持的每种文件格式(java,c ++,python等)执行此操作。

For a lexer/parser generator, you should look at ANTLR. 对于词法分析器/解析器生成器,您应该查看ANTLR。 TXL, which is a textual transformation interpreter, is also worth a look. TXL是一个文本转换解释器,值得一看。 Ready-made grammars should be available for both. 两者都应该有现成的语法。

除了已经建议的ANTLR之外,您还可以看一下JFlex

acacia-lex lexer has method replace. acacia-lex lexer有方法替代。

In Lexer token define, what looks like identifiers, for example, "ident1" -> "[a..d]", "ident2" -> "[e..h]". 在Lexer令牌定义中,看起来像标识符,例如“ident1” - >“[a..d]”,“ident2” - >“[e..h]”。

In replace method input map provide the info, which identifier type to replace with which constant (object), for example, "ident1" -> "ident1", "ident2" -> "ident2". 在替换方法输入映射中提供信息,该标识符类型用哪个常量(对象)替换,例如“ident1” - >“ident1”,“ident2” - >“ident2”。

Be aware that there are some languages where it's not really possible to do what you're trying to do. 请注意,有些语言无法完成您正在尝试的操作。 Specifically, those where it's not possible to determine what is or isn't a variable based on the grammar. 具体而言,那些根据语法无法确定变量是什么或不是变量的那些。 Tcl is an example of such, but there are a number of dynamic languages that have the same issue (Lisp?). Tcl就是这样的一个例子,但是有很多动态语言都有相同的问题(Lisp?)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 解析 Java 源时如何解析标识符的类型? - How can I resolve the type of identifiers when parsing Java source? 我可以使用哪些策略来阻止用户发现网站的语言? - What tactics can I use to prevent users from discovering what language a website is written in? 生成标识符以供Hibernate的“ assigned”生成器使用 - Generating identifiers for use with Hibernate's “assigned” generator 如何在JavaCC解析器中使用JFlex词法分析器? - How can I use a JFlex lexer with JavaCC parser? 如何去除跟踪器的开源应用程序? - How can I strip an open source app of trackers? 脚本的用途是什么,如何使两种语言交流 - What is the use of scripting and how can i make the two language communicate 我可以使用Antlr创建的词法分析器/解析器来解析PDDL文件并将数据返回给Java程序吗? - Can I use an Antlr created lexer/parser to parse PDDL file and return data to a Java program? 奇怪的wav文件。 我可以使用什么过滤器? - Strange wav files. What filter can I use? 我可以使用什么工具/实用程序列出 windows 上已删除的文件? - What Tool/Utility can I use to list deleted files on windows? 如何在jsp文件中使用表达式语言? 哪些隐式对象可用? - How can I use Expression Language in a jsp file? What implicit objects are available?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM