简体   繁体   English

有没有人成功使用 Antlr v4 为 Hive 生成​​ javascript

[英]Has anyone been successful with Antlr v4 generating javascript for Hive

My aim is to parse SQL (specifically Hive) statements with javascript, preferably Nodejs.我的目标是使用 javascript(最好是 Nodejs)解析 SQL(特别是 Hive)语句。 I started out with node-sql-parser which looked promising.我从看起来很有希望的 node-sql-parser 开始。 However I found quite a few cases where the parser did not recognize valid SQL like several nested functions on a column in a select clause, and multiple AND clauses in SQL that had lots of joins, unions, etc. (I've logged as issue but it will take some time).但是,我发现很多情况下解析器无法识别有效的 SQL,例如 select 子句中列上的几个嵌套函数,以及 SQL 中具有大量连接、联合等的多个 AND 子句(我已记录为问题但需要一些时间)。

I decided to look at Antlr v4.我决定看看 Antlr v4。 I followed the getting started steps with Hive SQL grammar.我遵循了 Hive SQL 语法的入门步骤。 ( https://github.com/apache/hive/blob/master/hplsql/src/main/antlr4/org/apache/hive/hplsql/Hplsql.g4 ); ( https://github.com/apache/hive/blob/master/hplsql/src/main/antlr4/org/apache/hive/hplsql/Hplsql.g4 ); I generated parsers lexers and listeners using Antlr's generation for JavaSCript - all good so far.我使用 Antlr 为 JavaSCRipt 生成的解析器词法分析器和侦听器 - 到目前为止一切都很好。 Then I tried a simple test as below:然后我尝试了一个简单的测试,如下所示:

const HplsqlLexer = require('./HplsqlLexer');
const HplsqlParser = require('./HplsqlParser');
const input = "select * from table_a"
var chars = new antlr4.InputStream(input);
var lexer = new HplsqlLexer.HplsqlLexer(chars);
var tokens = new antlr4.CommonTokenStream(lexer);
var parser = new HplsqlParser.HplsqlParser(tokens);
parser.buildParseTrees = true;
const tree = parser.program();

I believe "program()" is the entry point into the parser but I could be wrong.我相信“program()”是解析器的入口点,但我可能是错的。 This gave me "ReferenceError: _input is not defined" at the parser.program() line.这在 parser.program() 行给了我“ReferenceError: _input is not defined”。 I questioned whether the Hplsql.g4 could be missing something but ruled that out.我怀疑 Hplsql.g4 是否可能缺少某些内容,但排除了这一点。 Then I looked at the generated code in HplsqlParser.js - I added var _input = "" at the top and reran;然后我查看了 HplsqlParser.js 中生成的代码 - 我在顶部添加了 var _input = "" 并重新运行; then it complained ablut LT is not defined.然后它抱怨 ablut LT 没有定义。 Feels like a rabbit hole.感觉像个兔子洞。

Next steps include running the Java version of the Antlr parser, then Calcite.接下来的步骤包括运行 Java 版本的 Antlr 解析器,然后是 Calcite。 (hplsql.org is not what I am looking for). (hplsql.org 不是我要找的)。
node --version: v15.2.1.节点 --version:v15.2.1。 Any suggestions or pointers would be helpful.任何建议或指示都会有所帮助。

As mentioned in the comments by kaby76: the grammar contains target specific (Java) code.正如 kaby76 的评论中提到的:语法包含目标特定(Java)代码。 You need to replace all Java code between { and }?您需要替换{}?之间的所有 Java 代码}? with TypeScipt code.带有 TypeScipt 代码。

For example, this Java code:例如,这个 Java 代码:

{!_input.LT(2).getText().equalsIgnoreCase("TRANSACTION")}?

can be rewritten into this:可以改写成这样:

{this._input.LT(2).text.toUpperCase() !== 'TRANSACTION'}?

(not tested!) (未测试!)

EDIT编辑

I quickly did a global search and replace for pattern _input\\.LT\\((\\d+)\\).getText\\(\\)\\.equalsIgnoreCase\\("(\\w+)"\\) with replacement string (this._input.LT(\\1).text.toUpperCase() === '\\2') , which resulted in the following grammar:https://gist.github.com/bkiers/bb68b25ed03cf6c8ffae2709606d27a5我迅速进行了全局搜索,并用替换字符串(this._input.LT(\\1).text.toUpperCase() === '\\2') _input\\.LT\\((\\d+)\\).getText\\(\\)\\.equalsIgnoreCase\\("(\\w+)"\\)替换了模式_input\\.LT\\((\\d+)\\).getText\\(\\)\\.equalsIgnoreCase\\("(\\w+)"\\) (this._input.LT(\\1).text.toUpperCase() === '\\2') ,导致以下语法:https : (this._input.LT(\\1).text.toUpperCase() === '\\2')

EDIT 2编辑 2

I am surprised that Antlr even has a flag for -Dlanguage=JavaScript for parser generation.我很惊讶 Antlr 甚至有一个 -Dlanguage=JavaScript 的标志来生成解析器。 What's the point if it still is essentially Java?如果它本质上仍然是 Java,那又有什么意义呢?

The -Dlanguage=JavaScript makes sure to generate the lexer and parser classes in JavaScript. -Dlanguage=JavaScript确保在 JavaScript 中生成词法分析器和解析器类。 What it does not do, is rewrite semantic predicates , which are just copied "as-is".它不做的是重写 语义谓词,它们只是“按原样”复制。 Note that it is always recommended to not use semantic predicates because of that, but move such target specific code into visitor or listener classes.请注意,因此始终建议不要使用语义谓词,而是将此类目标特定代码移动到访问者或侦听器类中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM