简体   繁体   English

上下文无关的语法来识别行尾空格

[英]Context-free grammar to recognize end-of-line whitespace

I am trying to write a context-free grammar to do something very simple—to parse a string into a list of alternating sections of (1) end-of-line whitespace and (2) everything else. 我正在尝试编写无上下文语法来做一些非常简单的事情-将字符串解析为(1)行尾空格和(2)其他所有内容的交替部分的列表。 For example: 例如:

This.first.line...\n..and.this....second.line\n.\n..and.final.line

(showing " " as "." and newlines as "\\n" for readability) is parsed as (为了便于阅读,将" "显示为"." ,将换行符显示为"\\n" )被解析为

"This.first.line", "...\n..", "and.this....second.line", "\n.\n..", "and.final.line"

I wrote this grammar: 我写了这个语法:

string = raw_start | newline_start
raw_start = raw_section [newline_start]
newline_start = newline_section [raw_start]
raw_section = {any_character_except_newline}
newline_section = {whitespace_except_newline} new_line {any_whitespace_character}

But this is not correct because the {any_character_except_newline} will consume the spaces leading up to newlines, when I want those included with the new_line_section . 但这是不正确的,因为当我希望new_line_section包含空格时, {any_character_except_newline}会占用导致换行的new_line_section

Is it possible to say "Consume spaces, unless they are right before a newline" without losing the context-free property of the grammar? 是否可以说“使用空格,除非它们在换行符之前”而不丢失语法的上下文无关属性?

Sure, context-free is not a problem. 当然,上下文无关不是问题。 Both "end-of-line whitespace" and "everything else" are regular languages. “行末空白”和“其他所有”都是常规语言。

For reference, here are the regular expressions (formally regular, not "recognizable with some 'regex' package"). 作为参考,这里是正则表达式(正式的正则表达式,不是“可通过某些'regex'包识别的”表达式)。 We suppose that A is the alphabet, and define: 我们假设A是字母,并定义:

NOTSPACE = { ∀x | x ∈ A ∧ x ≠ NL ∧ x ≠ SPACE }
NOTEOL   = { ∀x | x ∈ A ∧ x ≠ NL }
EVERYTHING_ELSE = { xωy | x,y ∈ NOTSPACE ∧ ω ∈ NOTEOL* } ⋃ NOTSPACE
EOL_WHITESPACE = { ωNLγ | ω,γ ∈ {SPACE, NL}* }

That can easily be transformed into a CFG. 可以很容易地将其转换为CFG。 (It's possible that the text ends with whitespace which doesn't include a newline. The following ignores that possibility, but it could easily be added): (文本可能以空白结尾,其中不包含换行符。以下内容忽略了这种可能性,但可以轻松添加):

S → Spaces
S → S Other
S → S EOL_WS
Spaces → ε
Spaces → Spaces [ ]
Other → [^ \n] Line [^ \n]
Other → [^ \n]
Line → ε
Line → Line [^\n]
EOL_WS → Spaces NL_Spaces
NL_Spaces → NL_Space
NL_Spaces → NL_Spaces NL_Space
NL_Space → [/n] Spaces
 

As written, the above is ambiguous because it does not insist that Other and EOL_WS be maximally long. 如所写,上面是模棱两可的,因为它没有坚持要求OtherEOL_WS最长。 That's easy to fix but tedious, and since the OP only asks for a CFG and not an unambiguous or LR(1) CFG, I'll leave it at that. 这很容易解决,但很乏味,并且由于OP仅要求提供CFG而不是明确的或LR(1)CFG,因此我将其保留。

This is a translation of rici's great answer into the EBNF format I used in my question: 这是rici很好的答案的翻译成我在问题中使用的EBNF格式:

string = raw_start | newline_start
raw_start = raw_section [newline_start]
newline_start = newline_section [raw_start]
raw_section = any_nonwhite_character [{any_character_except_newline} any_nonwhite_character]
newline_section = {whitespace_except_newline} new_line {any_whitespace_character}

The key was changing the definition of raw_section to require that it end with a nonwhite character. 关键是更改raw_section的定义,以要求它以非raw_section字符结尾。 This simple grammar will not match empty strings or strings that end with a space, but that is easy to fix. 这个简单的语法不会匹配空字符串或以空格结尾的字符串,但是很容易解决。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM