TCL regexp表达式可捕获方括号中的单词

Question

I want to parse: ([(A touch B) over C] touch {D touch E}) is good . 我想解析([(A touch B) over C] touch {D touch E}) is good ： ([(A touch B) over C] touch {D touch E}) is good 。

Using: ( P1 touch P2) is good . 使用： ( P1 touch P2) is good 。

I want to replace P1 and P2 by regular expression to get P1 = [(A touch B) over C] P2 = {D touch E} My first idea is: ( (.*) touch (.*)) is good . 我想用正则表达式替换P1和P2以获得P1 = [(A touch B) over C] P2 = {D touch E}我的第一个想法是： ( (.*) touch (.*)) is good 。

But i got wrong matching: P1 = [(A touch B) over C] touch {D P2 = E} 但是我得到了错误的匹配： P1 = [(A touch B) over C] touch {D P2 = E}

I want to break on the "touch" outside the brackets. 我想打破括号外的"touch" 。 Note: A, B, C... are examples so we should use .* 注意： A, B, C... are examples因此我们应使用.*

Answer 1

A regular expression matching can't handle arbitrary expressions of this kind, but most parsers can. 正则表达式匹配不能处理这种任意表达式，但是大多数解析器都可以。 In the Tcllib standard library for Tcl there is a PEG (parsing expression grammar) parser generator in the pt (parser tools) module. 在Tcl的Tcllib标准库中，在pt （解析器工具）模块中有一个PEG（解析表达语法）解析器生成器。 The following defines a grammar (as a simple string) that can parse your example text, and also text where brackets of the same kind are nested: 以下内容定义了一个语法（作为简单字符串），可以解析您的示例文本以及嵌套了相同括号的文本：

set grammar {
PEG Reader (Datum)
      Datum       <- '(' <space>* P1 'touch' <space>* P2 ') is good' ;
      Word        <- <alpha>+ ;
      BrackExpr   <- '[' Expression+ ']' ;
      BraceExpr   <- '{' Expression+ '}' ;
      ParenExpr   <- '(' Expression+ ')' ;
void: Expression  <- (Word / BrackExpr / BraceExpr / ParenExpr) <space>* ;
      P1          <- Expression ;
      P2          <- Expression ;
END;
}

To use this, you need to follow some steps. 要使用此功能，您需要执行一些步骤。 First save the generated parser in a file: 首先将生成的解析器保存在一个文件中：

package require fileutil
package require pt::pgen

fileutil::writeFile ./reader.tcl [pt::pgen peg $grammar oo -class Reader]

This code creates a file, reader.tcl, that contains a TclOO class that perform the parsing specified in the grammar. 此代码创建一个文件reader.tcl，其中包含一个TclOO类，该类执行语法中指定的解析。 You source up that file to make the class available: 您source最多的是文件，以使现有的类：

source reader.tcl

If you are going to do that multiple times from the console, you need to destroy the class in between: 如果要从控制台多次执行该操作，则需要在以下两者之间销毁该类：

catch {Reader destroy} ; source reader.tcl

Then you make an instance of the class and put it to work: 然后，创建该类的实例并将其投入工作：

Reader create reader
set str {([(A touch B) over C] touch {D touch E}) is good}
reader parset $str

(If you use the parse method instead, you can parse an open channel.) （如果改用parse方法，则可以解析一个开放通道。）

The result of the parsing is the AST (abstract syntax tree) 解析的结果是AST（抽象语法树）

Datum 0 47 {P1 1 21} {P2 28 38}

As you can see, it has found text for P1 at string indices 1 to 21, and for P2 at 28 to 38 (note that trailing whitespace is captured). 如您所见，它在字符串索引1到21处找到了P1的文本，在28到38处找到了P2的文本（请注意捕获了尾随空格）。 You can use string range $str 1 21 to get the text for P1, or automate it: 您可以使用string range $str 1 21来获取P1的文本，或使其自动化：

proc Datum {from to args} {foreach arg $args {puts [uplevel 1 $arg]}}
proc P1 {from to} {string range $::str $from $to}
proc P2 {from to} {string range $::str $from $to}

% Datum 0 47 {P1 1 21} {P2 28 38}
[(A touch B) over C]
{D touch E}

If you use this, you might want to experiment with the definition of the nonterminal Word . 如果使用此方法，则可能需要尝试非终结Word的定义。 Currently it only allows alphabetic characters. 当前，它仅允许使用字母字符。 A definition like 像这样的定义

      Word        <- (<alnum> / [;:/&!?*+.\'\"#])+ ;

allows digits and some punctuation too. 允许数字和一些标点符号。 There are more inclusive character sets, but <punct> , <graph> , and <print> all contain bracket characters. 包含更多的字符集，但是<punct> ， <graph>和<print>都包含方括号字符。 Even if the definition of Expression is changed 即使Expression的定义被更改

void: Expression  <- (BrackExpr / BraceExpr / ParenExpr / Word) <space>* ;

so that Word comes last, which lets the parser choose one of the *Expr nonterminals on encountering an opening bracket, the closing bracket will still be consumed by Word and not by the correct nonterminal expression. 因此， Word在最后，这样解析器就可以在遇到一个*Expr括号时选择*Expr非终结符之一，但闭括号仍然会被Word占用，而不是由正确的非终结符表达占用。 AFAICT this is a limitation of PEG parsers, which do not backtrack. AFAICT这是PEG解析器的局限性，不会回溯。 The grammar can be modified to deal with this, but it will very quickly become too complicated. 可以对语法进行修改以解决此问题，但是它将很快变得太复杂。

Documentation: pt (package) 文档： pt（包）

Answer 2

You may use a regex pattern that will match either a [...] or {...} substrings before touch and then the same pattern after. 您可以使用正则表达式模式，该模式将在touch之前与[...]或{...}子字符串匹配，然后与之后的相同模式匹配。

It will look like 看起来像

\((\[[^][]*]|{[^{}]*}) touch (\[[^][]*]|{[^{}]*})\) is good

See the regex demo at regex101 . 请参阅regex101上的regex演示。 The (\\[[^][]*]|{[^{}]*}) construct is a capturing group that matches either a [ + 0+ chars other than [ and ] and then a ] or a { followed with 0+ chars other than { and } and then } . (\\[[^][]*]|{[^{}]*})结构是一个捕获组，它与[ + 0+除[和]以外的其他字符，然后与]或{与后跟0的字符匹配+除{和}之后的字符，然后} 。

One remark: there is no way to match nested balanced bracketed substrings with Tcl regex. 备注：无法将嵌套的平衡括弧子字符串与Tcl regex匹配。

When using Tcl code, you may build the regex dynamically: 使用Tcl代码时，您可以动态构建正则表达式：

set txt {([(A touch B) over C] touch {D touch E}) is good}
set squares {\[[^][]*\]}
set braces {\{[^{}]*\}}
set cap "($squares|$braces)"
set rx "\\($cap touch $cap\\) is good"; 
lassign [lrange [regexp -all -inline $rx $txt] 1 end] P1 P2
puts "$P1 ::: $P2"

Output: [(A touch B) over C] ::: {D touch E} . 输出： [(A touch B) over C] ::: {D touch E} :: [(A touch B) over C] ::: {D touch E} 。

See the Tcl online demo . 参见Tcl在线演示。

TCL regexp表达式可捕获方括号中的单词

问题描述

2 个解决方案

解决方案1
2 2017-12-07 09:13:03

解决方案2
0 2017-12-06 17:18:04

TCL regexp表达式可捕获方括号中的单词

问题描述

2 个解决方案

解决方案1 2 2017-12-07 09:13:03

解决方案2 0 2017-12-06 17:18:04

解决方案1
2 2017-12-07 09:13:03

解决方案2
0 2017-12-06 17:18:04