[英]Tcl regexp expression to catch words outside brackets
I want to parse: ([(A touch B) over C] touch {D touch E}) is good
. 我想解析([(A touch B) over C] touch {D touch E}) is good
: ([(A touch B) over C] touch {D touch E}) is good
。
Using: ( P1 touch P2) is good
. 使用: ( P1 touch P2) is good
。
I want to replace P1
and P2
by regular expression to get P1 = [(A touch B) over C] P2 = {D touch E}
My first idea is: ( (.*) touch (.*)) is good
. 我想用正则表达式替换P1
和P2
以获得P1 = [(A touch B) over C] P2 = {D touch E}
我的第一个想法是: ( (.*) touch (.*)) is good
。
But i got wrong matching: P1 = [(A touch B) over C] touch {D P2 = E}
但是我得到了错误的匹配: P1 = [(A touch B) over C] touch {D P2 = E}
I want to break on the "touch"
outside the brackets. 我想打破括号外的"touch"
。 Note: A, B, C... are examples
so we should use .*
注意: A, B, C... are examples
因此我们应使用.*
A regular expression matching can't handle arbitrary expressions of this kind, but most parsers can. 正则表达式匹配不能处理这种任意表达式,但是大多数解析器都可以。 In the Tcllib standard library for Tcl there is a PEG (parsing expression grammar) parser generator in the pt
(parser tools) module. 在Tcl的Tcllib标准库中,在pt
(解析器工具)模块中有一个PEG(解析表达语法)解析器生成器。 The following defines a grammar (as a simple string) that can parse your example text, and also text where brackets of the same kind are nested: 以下内容定义了一个语法(作为简单字符串),可以解析您的示例文本以及嵌套了相同括号的文本:
set grammar {
PEG Reader (Datum)
Datum <- '(' <space>* P1 'touch' <space>* P2 ') is good' ;
Word <- <alpha>+ ;
BrackExpr <- '[' Expression+ ']' ;
BraceExpr <- '{' Expression+ '}' ;
ParenExpr <- '(' Expression+ ')' ;
void: Expression <- (Word / BrackExpr / BraceExpr / ParenExpr) <space>* ;
P1 <- Expression ;
P2 <- Expression ;
END;
}
To use this, you need to follow some steps. 要使用此功能,您需要执行一些步骤。 First save the generated parser in a file: 首先将生成的解析器保存在一个文件中:
package require fileutil
package require pt::pgen
fileutil::writeFile ./reader.tcl [pt::pgen peg $grammar oo -class Reader]
This code creates a file, reader.tcl, that contains a TclOO class that perform the parsing specified in the grammar. 此代码创建一个文件reader.tcl,其中包含一个TclOO类,该类执行语法中指定的解析。 You source
up that file to make the class available: 您source
最多的是文件,以使现有的类:
source reader.tcl
If you are going to do that multiple times from the console, you need to destroy the class in between: 如果要从控制台多次执行该操作,则需要在以下两者之间销毁该类:
catch {Reader destroy} ; source reader.tcl
Then you make an instance of the class and put it to work: 然后,创建该类的实例并将其投入工作:
Reader create reader
set str {([(A touch B) over C] touch {D touch E}) is good}
reader parset $str
(If you use the parse
method instead, you can parse an open channel.) (如果改用parse
方法,则可以解析一个开放通道。)
The result of the parsing is the AST (abstract syntax tree) 解析的结果是AST(抽象语法树)
Datum 0 47 {P1 1 21} {P2 28 38}
As you can see, it has found text for P1 at string indices 1 to 21, and for P2 at 28 to 38 (note that trailing whitespace is captured). 如您所见,它在字符串索引1到21处找到了P1的文本,在28到38处找到了P2的文本(请注意捕获了尾随空格)。 You can use string range $str 1 21
to get the text for P1, or automate it: 您可以使用string range $str 1 21
来获取P1的文本,或使其自动化:
proc Datum {from to args} {foreach arg $args {puts [uplevel 1 $arg]}}
proc P1 {from to} {string range $::str $from $to}
proc P2 {from to} {string range $::str $from $to}
% Datum 0 47 {P1 1 21} {P2 28 38}
[(A touch B) over C]
{D touch E}
If you use this, you might want to experiment with the definition of the nonterminal Word
. 如果使用此方法,则可能需要尝试非终结Word
的定义。 Currently it only allows alphabetic characters. 当前,它仅允许使用字母字符。 A definition like 像这样的定义
Word <- (<alnum> / [;:/&!?*+.\'\"#])+ ;
allows digits and some punctuation too. 允许数字和一些标点符号。 There are more inclusive character sets, but <punct>
, <graph>
, and <print>
all contain bracket characters. 包含更多的字符集,但是<punct>
, <graph>
和<print>
都包含方括号字符。 Even if the definition of Expression
is changed 即使Expression
的定义被更改
void: Expression <- (BrackExpr / BraceExpr / ParenExpr / Word) <space>* ;
so that Word
comes last, which lets the parser choose one of the *Expr
nonterminals on encountering an opening bracket, the closing bracket will still be consumed by Word
and not by the correct nonterminal expression. 因此, Word
在最后,这样解析器就可以在遇到一个*Expr
括号时选择*Expr
非终结符之一,但闭括号仍然会被Word
占用,而不是由正确的非终结符表达占用。 AFAICT this is a limitation of PEG parsers, which do not backtrack. AFAICT这是PEG解析器的局限性,不会回溯。 The grammar can be modified to deal with this, but it will very quickly become too complicated. 可以对语法进行修改以解决此问题,但是它将很快变得太复杂。
Documentation: pt (package) 文档: pt(包)
You may use a regex pattern that will match either a [...]
or {...}
substrings before touch
and then the same pattern after. 您可以使用正则表达式模式,该模式将在touch
之前与[...]
或{...}
子字符串匹配,然后与之后的相同模式匹配。
It will look like 看起来像
\((\[[^][]*]|{[^{}]*}) touch (\[[^][]*]|{[^{}]*})\) is good
See the regex demo at regex101 . 请参阅regex101上的regex演示 。 The (\\[[^][]*]|{[^{}]*})
construct is a capturing group that matches either a [
+ 0+ chars other than [
and ]
and then a ]
or a {
followed with 0+ chars other than {
and }
and then }
. (\\[[^][]*]|{[^{}]*})
结构是一个捕获组,它与[
+ 0+除[
和]
以外的其他字符,然后与]
或{
与后跟0的字符匹配+除{
和}
之后的字符,然后}
。
One remark: there is no way to match nested balanced bracketed substrings with Tcl regex. 备注:无法将嵌套的平衡括弧子字符串与Tcl regex匹配。
When using Tcl code, you may build the regex dynamically: 使用Tcl代码时,您可以动态构建正则表达式:
set txt {([(A touch B) over C] touch {D touch E}) is good}
set squares {\[[^][]*\]}
set braces {\{[^{}]*\}}
set cap "($squares|$braces)"
set rx "\\($cap touch $cap\\) is good";
lassign [lrange [regexp -all -inline $rx $txt] 1 end] P1 P2
puts "$P1 ::: $P2"
Output: [(A touch B) over C] ::: {D touch E}
. 输出: [(A touch B) over C] ::: {D touch E}
:: [(A touch B) over C] ::: {D touch E}
。
See the Tcl online demo . 参见Tcl在线演示 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.