简体   繁体   中英

How to improve performance of Nearley-based parser

I'm trying to parse SQL INSERT statements like:

INSERT INTO `Album` (`Title`, `ArtistId`) VALUES ('Blue Moods', 89);

using the following grammar written for Nearley :

main -> statement:*

statement -> insert_clause values_clause ";"

insert_clause -> "INSERT INTO" identifier parenthesis

values_clause -> "VALUES" parenthesis

parenthesis -> "(" list ")"

list -> value ("," value):*

value -> identifier | literal

identifier -> %IDENTIFIER | %QUOTED_IDENTIFIER

literal -> %STRING | %NUMBER

I have removed all the post-processor code for clarity as I think it shouldn't have much of a performance impact. The most expensive thing I'm doing there is calling Array.flat() in post-processors of main and list rules.

The generated parser takes over 20 seconds to parse 1.6 MB of SQL. In contrast, a recursive-descent parser which I wrote manually to do the same thing, takes under a second. The parser also gets slower and slower as I give it more input, so it's definitely not linear time.

I found an issue from Nearley Github which mentions 3 possible culprits for performance problems with Nearley:

  1. Not using a Tokenizer - I do use a tokenizer, so not a problem.
  2. Having an ambiguous grammar - originally I had problems with ambiguity, but I've since fixed them and the grammar in here shouldn't IMHO have any issues.
  3. Using right-recursive rules.

The last one I'm not sure of. The list rule seems right-recursive to me, so I tried rewriting it as:

list -> (value ","):* value

But this had zero effect on performance.

I'm out of ideas now.

I'm pretty inexperienced with parser generators, so hopefully this is some sort of silly mistake that I'm making here.

PS. Ignore the "INSERT INTO" token. I know it should really be two separate tokens.

Discovered that the problems was with "expensive" post processor functions. Basically I had this:

main -> statement:* {% flatten %}

Turns out that instead of the post-processor function being executed when all statements are matched it gets executed at every increment:

  1. with 0 statements
  2. with 1 statement
  3. with 2 statements
  4. ...

As I was calling flatten() function which loops though the whole array, this resulted in O(n 2 ) complexity.

There's a nearley issue , which describes a similar problem, stating that this behavior is by design, as Nearley is a streaming parser.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM