简体   繁体   English

TCL regexp表达式可捕获方括号中的单词

[英]Tcl regexp expression to catch words outside brackets

I want to parse: ([(A touch B) over C] touch {D touch E}) is good . 我想解析([(A touch B) over C] touch {D touch E}) is good([(A touch B) over C] touch {D touch E}) is good

Using: ( P1 touch P2) is good . 使用: ( P1 touch P2) is good

I want to replace P1 and P2 by regular expression to get P1 = [(A touch B) over C] P2 = {D touch E} My first idea is: ( (.*) touch (.*)) is good . 我想用正则表达式替换P1P2以获得P1 = [(A touch B) over C] P2 = {D touch E}我的第一个想法是: ( (.*) touch (.*)) is good

But i got wrong matching: P1 = [(A touch B) over C] touch {D P2 = E} 但是我得到了错误的匹配: P1 = [(A touch B) over C] touch {D P2 = E}

I want to break on the "touch" outside the brackets. 我想打破括号外的"touch" Note: A, B, C... are examples so we should use .* 注意: A, B, C... are examples因此我们应使用.*

A regular expression matching can't handle arbitrary expressions of this kind, but most parsers can. 正则表达式匹配不能处理这种任意表达式,但是大多数解析器都可以。 In the Tcllib standard library for Tcl there is a PEG (parsing expression grammar) parser generator in the pt (parser tools) module. 在Tcl的Tcllib标准库中,在pt (解析器工具)模块中有一个PEG(解析表达语法)解析器生成器。 The following defines a grammar (as a simple string) that can parse your example text, and also text where brackets of the same kind are nested: 以下内容定义了一个语法(作为简单字符串),可以解析您的示例文本以及嵌套了相同括号的文本:

set grammar {
PEG Reader (Datum)
      Datum       <- '(' <space>* P1 'touch' <space>* P2 ') is good' ;
      Word        <- <alpha>+ ;
      BrackExpr   <- '[' Expression+ ']' ;
      BraceExpr   <- '{' Expression+ '}' ;
      ParenExpr   <- '(' Expression+ ')' ;
void: Expression  <- (Word / BrackExpr / BraceExpr / ParenExpr) <space>* ;
      P1          <- Expression ;
      P2          <- Expression ;
END;
}

To use this, you need to follow some steps. 要使用此功能,您需要执行一些步骤。 First save the generated parser in a file: 首先将生成的解析器保存在一个文件中:

package require fileutil
package require pt::pgen

fileutil::writeFile ./reader.tcl [pt::pgen peg $grammar oo -class Reader]

This code creates a file, reader.tcl, that contains a TclOO class that perform the parsing specified in the grammar. 此代码创建一个文件reader.tcl,其中包含一个TclOO类,该类执行语法中指定的解析。 You source up that file to make the class available: source最多的是文件,以使现有的类:

source reader.tcl

If you are going to do that multiple times from the console, you need to destroy the class in between: 如果要从控制台多次执行该操作,则需要在以下两者之间销毁该类:

catch {Reader destroy} ; source reader.tcl

Then you make an instance of the class and put it to work: 然后,创建该类的实例并将其投入工作:

Reader create reader
set str {([(A touch B) over C] touch {D touch E}) is good}
reader parset $str

(If you use the parse method instead, you can parse an open channel.) (如果改用parse方法,则可以解析一个开放通道。)

The result of the parsing is the AST (abstract syntax tree) 解析的结果是AST(抽象语法树)

Datum 0 47 {P1 1 21} {P2 28 38}

As you can see, it has found text for P1 at string indices 1 to 21, and for P2 at 28 to 38 (note that trailing whitespace is captured). 如您所见,它在字符串索引1到21处找到了P1的文本,在28到38处找到了P2的文本(请注意捕获了尾随空格)。 You can use string range $str 1 21 to get the text for P1, or automate it: 您可以使用string range $str 1 21来获取P1的文本,或使其自动化:

proc Datum {from to args} {foreach arg $args {puts [uplevel 1 $arg]}}
proc P1 {from to} {string range $::str $from $to}
proc P2 {from to} {string range $::str $from $to}

% Datum 0 47 {P1 1 21} {P2 28 38}
[(A touch B) over C]
{D touch E}

If you use this, you might want to experiment with the definition of the nonterminal Word . 如果使用此方法,则可能需要尝试非终结Word的定义。 Currently it only allows alphabetic characters. 当前,它仅允许使用字母字符。 A definition like 像这样的定义

      Word        <- (<alnum> / [;:/&!?*+.\'\"#])+ ;

allows digits and some punctuation too. 允许数字和一些标点符号。 There are more inclusive character sets, but <punct> , <graph> , and <print> all contain bracket characters. 包含更多的字符集,但是<punct><graph><print>都包含方括号字符。 Even if the definition of Expression is changed 即使Expression的定义被更改

void: Expression  <- (BrackExpr / BraceExpr / ParenExpr / Word) <space>* ;

so that Word comes last, which lets the parser choose one of the *Expr nonterminals on encountering an opening bracket, the closing bracket will still be consumed by Word and not by the correct nonterminal expression. 因此, Word在最后,这样解析器就可以在遇到一个*Expr括号时选择*Expr非终结符之一,但闭括号仍然会被Word占用,而不是由正确的非终结符表达占用。 AFAICT this is a limitation of PEG parsers, which do not backtrack. AFAICT这是PEG解析器的局限性,不会回溯。 The grammar can be modified to deal with this, but it will very quickly become too complicated. 可以对语法进行修改以解决此问题,但是它将很快变得太复杂。

Documentation: pt (package) 文档: pt(包)

You may use a regex pattern that will match either a [...] or {...} substrings before touch and then the same pattern after. 您可以使用正则表达式模式,该模式将在touch之前与[...]{...}子字符串匹配,然后与之后的相同模式匹配。

It will look like 看起来像

\((\[[^][]*]|{[^{}]*}) touch (\[[^][]*]|{[^{}]*})\) is good

See the regex demo at regex101 . 请参阅regex101上regex演示 The (\\[[^][]*]|{[^{}]*}) construct is a capturing group that matches either a [ + 0+ chars other than [ and ] and then a ] or a { followed with 0+ chars other than { and } and then } . (\\[[^][]*]|{[^{}]*})结构是一个捕获组,它与[ + 0+除[]以外的其他字符,然后与]{与后跟0的字符匹配+除{}之后的字符,然后}

One remark: there is no way to match nested balanced bracketed substrings with Tcl regex. 备注:无法将嵌套的平衡括弧子字符串与Tcl regex匹配。

When using Tcl code, you may build the regex dynamically: 使用Tcl代码时,您可以动态构建正则表达式:

set txt {([(A touch B) over C] touch {D touch E}) is good}
set squares {\[[^][]*\]}
set braces {\{[^{}]*\}}
set cap "($squares|$braces)"
set rx "\\($cap touch $cap\\) is good"; 
lassign [lrange [regexp -all -inline $rx $txt] 1 end] P1 P2
puts "$P1 ::: $P2"

Output: [(A touch B) over C] ::: {D touch E} . 输出: [(A touch B) over C] ::: {D touch E} :: [(A touch B) over C] ::: {D touch E}

See the Tcl online demo . 参见Tcl在线演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM