简体   繁体   English

使用PEG.js简单解析问题

[英]Simple parsing questions using PEG.js

I'm trying to wrap my head around PEG by entering simple grammars into the PEG.js playground . 我试图通过在PEG.js操场上输入简单的语法来绕过PEG。

Example 1: 例1:

  • Input: "abcdef1234567ghijklmn8901opqrs" 输入: "abcdef1234567ghijklmn8901opqrs"
  • Desired output: ["abcdef", "1234567", "ghijklmn", "8901", "opqrs"] 期望的输出: ["abcdef", "1234567", "ghijklmn", "8901", "opqrs"]

  • Actual output: ["abcdef", ["1234567", ["ghijklmn", ["8901", ["opqrs", ""]]]]] 实际输出: ["abcdef", ["1234567", ["ghijklmn", ["8901", ["opqrs", ""]]]]]

This example pretty much works, but can I get PEG.js to not nest the resulting array to a million levels? 这个例子非常有效,但是我可以让PEG.js不将生成的数组嵌套到百万级别吗? I assume the trick is to use concat() instead of join() somewhere, but I can't find the spot. 我假设诀窍是在某处使用concat()而不是join() ,但我找不到这个地方。

start
  = Text

Text
  = Numbers Text
  / Characters Text
  / EOF

Numbers
  = numbers: [0-9]+ {return numbers.join("")}

Characters
  = text: [a-z]+ {return text.join("")}

EOF
  = !.

Example 2: 例2:

Same problem and code as Example 1, but change the Characters rule to the following, which I expected would produce the same result. 与示例1相同的问题和代码,但将字符规则更改为以下内容,我预期会产生相同的结果。

Characters
  = text: (!Numbers .)+ {return text.join("")}

The resulting output is: 结果输出是:

[",a,b,c,d,e,f", ["1234567", [",g,h,i,j,k,l,m,n", ["8901", [",o,p,q,r,s", ""]]]]]

Why do I get all these empty matches? 为什么我得到所有这些空的比赛?

Example 3: 例3:

Last question. 最后一个问题。 This doesn't work at all. 这根本不起作用。 How can I make it work? 我怎样才能使它工作? And for bonus points, any pointers on efficiency? 对于奖励积分,任何关于效率的指针? For example, should I avoid recursion if possible? 例如,如果可能,我应该避免递归吗?

I'd also appreciate a link to a good PEG tutorial. 我也很欣赏一个很好的PEG教程的链接。 I've read ( http://www.codeproject.com/KB/recipes/grammar_support_1.aspx ), but as you can see I need more help ... 我已阅读( http://www.codeproject.com/KB/recipes/grammar_support_1.aspx ),但正如您所见,我需要更多帮助......

  • Input: 'abcdefghijklmnop"qrstuvwxyz"abcdefg' 输入: 'abcdefghijklmnop"qrstuvwxyz"abcdefg'
  • Desired output: ["abcdefghijklmnop", "qrstuvwxyz", "abcdefg"] 期望的输出: ["abcdefghijklmnop", "qrstuvwxyz", "abcdefg"]
  • Actual output: "abcdefghijklmnop\\"qrstuvwxyz\\"abcdefg" 实际输出: "abcdefghijklmnop\\"qrstuvwxyz\\"abcdefg"
start
  = Words

Words
  = Quote
  / Text
  / EOF

Quote
  = quote: ('"' .* '"') Words {return quote.join("")}

Text
  = text: (!Quote . Words) {return text.join("")}

EOF
  = !.

I received a reply in the PEG.js Google Group that helped me onto the right track. 我收到了PEG.js Google Group的回复,帮助我走上正轨。 I'm posting answers to all three problems in the hope that they can serve as a rudimentary tutorial for other PEG beginners like myself. 我正在发布所有三个问题的答案,希望它们可以作为像我这样的其他PEG初学者的基础教程。 Notice that no recursion is needed. 请注意,不需要递归。

Example 1: 例1:

This is straightforward once you understand basic PEG idioms. 一旦你理解了基本的PEG习语,这很简单。

start
  = Text+

Text
  = Numbers
  / Characters

Numbers
  = numbers: [0-9]+ {return numbers.join("")}

Characters
  = text: [a-z]+ {return text.join("")}

Example 2: 例2:

The problem here is a peculiar design choice in the PEG.js parser generator for Peek expressions (&expr and !expr). 这里的问题是Peek表达式(&expr和!expr)的PEG.js解析器生成器中的一种特殊设计选择。 Both peek ahead into the input stream without consuming any characters, so I incorrectly assumed that they didn't return anything. 两者都在不消耗任何字符的情况下向前查看输入流,因此我错误地认为它们没有返回任何内容。 However, they both return an empty string. 但是,它们都返回一个空字符串。 I hope the author of PEG.js changes this behavior, because (as far as I can tell) this is just unnecessary cruft that pollutes the output stream. 我希望PEG.js的作者改变这种行为,因为(据我所知)这只是污染输出流的不必要的瑕疵。 Please correct me if I'm wrong about this! 如果我错了,请纠正我!

Anyway, here is a workaround: 无论如何,这是一个解决方法:

start
  = Text+

Text
  = Numbers
  / Words

Numbers
  = numbers: [0-9]+ {return numbers.join("")}

Words
  = text: Letter+ {return text.join("")}

Letter
  = !Numbers text: . {return text}

Example 3: 例3:

The problem is that an expression like ('"' .* '"') can never succeed. 问题是像('"' .* '"')这样的表达式永远不会成功。 PEG is always greedy, so .* will consume the rest of the input stream and never see the second quote. PEG总是贪婪的,所以.*将消耗其余的输入流,永远不会看到第二个引用。 Here is a solution (that incidentally needs the same Peek workaround as in Example 2). 这是一个解决方案(顺便说一下,需要与示例2中相同的Peek解决方法)。

start
  = Words+

Words
  = QuotedString
  / Text

QuotedString
  = '"' quote: NotQuote* '"' {return quote.join("")}

NotQuote
  = !'"' char: . {return char}

Text
  = text: NotQuote+ {return text.join("")}

For current versions of pegjs , you might try: 对于当前版本的pegjs ,您可以尝试:

Example One 例一

Input: "abcdef1234567ghijklmn8901opqrs" 输入: "abcdef1234567ghijklmn8901opqrs"

Desired output: ["abcdef", "1234567", "ghijklmn", "8901", "opqrs"] 期望的输出: ["abcdef", "1234567", "ghijklmn", "8901", "opqrs"]

{
  /**
   * Deeply flatten an array.
   * @param  {Array} arr - array to flatten
   * @return {Array} - flattened array
   */
  const flatten = (arr) =>  Array.isArray(arr) ? arr.reduce((flat, elt) => flat.concat(Array.isArray(elt) ? flatten(elt) : elt), []) : arr
}

start = result:string {
  console.log(JSON.stringify(result))
  return result
}

string = head:chars tail:( digits chars? )* {
  return flatten([head,tail])
}

chars = [a-z]+ {
  return text()
}

digits = $[0-9]+ {
  return text()
}

Example 2 例2

Should be easy to deduce from the answer above. 应该很容易从上面的答案中推断出来。

Example 3 例3

Input: 'abcdefghijklmnop"qrstuvwxyz"abcdefg' 输入: 'abcdefghijklmnop"qrstuvwxyz"abcdefg'

Desired output: ["abcdefghijklmnop", "qrstuvwxyz", "abcdefg"] 期望的输出: ["abcdefghijklmnop", "qrstuvwxyz", "abcdefg"]

{
  /**
   * Deeply flatten an array.
   * @param  {Array} arr - array to flatten
   * @return {Array} - flattened array
   */
  const flatten = (arr) =>  Array.isArray(arr) ? arr.reduce((flat, elt) => flat.concat(Array.isArray(elt) ? flatten(elt) : elt), []) : arr
}

start = result:string {
  console.log(JSON.stringify(result))
  return result
}

string = head:chars tail:quote_chars* {
  return flatten([head,tail])
}

quote_chars = DQUOTE chars:chars {
  return chars
}

chars = [a-z]+ {
  return text()
}

DQUOTE = '"'

Example 1 例1

start
  = alnums

alnums
  = alnums:(alphas / numbers) {
    return alnums;
  }

alphas
  = alphas:$(alpha+)

numbers
  = numbers:$(number+)

number
  = [0-9]

alpha
  = [a-zA-Z]

Example 2 例2

ignore 忽视

Example 3 例3

> 'abcdefghijklmnop"qrstuvwxyz"abcdefg'.split('"')
[ 'abcdefghijklmnop',
  'qrstuvwxyz',
  'abcdefg' ]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM