简体   繁体   English

awk:致命:设置多个字段分隔符时正则表达式无效

[英]awk: fatal: Invalid regular expression when setting multiple field separators

I was trying to solve Grep regex to select only 10 character using awk .我试图解决Grep regex使用awk只选择 10 个字符 The question consists in a string XXXXXX[YYYYY--ZZZZZ and the OP wants to print the text in between the unique [ and -- strings within the text.问题包含在字符串XXXXXX[YYYYY--ZZZZZ ,OP 想要在文本中唯一的[--字符串之间打印文本。

If it was just one - I would say use [-[] asfield separator (FS).如果它只是一个-我会说使用[-[]作为字段分隔符(FS)。 This is setting the FS to be either - or [ :这将 FS 设置为-[

$ echo "XXXXXXX[YYYYY-ZZZZ" | awk -F[-[] '{print $2}'
YYYYY

The tricky point is that [ has also a special meaning as a character class, so that to make it be correctly interpreted as one of the possible FS it cannot be written in the first position.棘手的一点是[作为字符类也有特殊含义,因此为了使其正确解释为可能的 FS 之一,它不能写在第一个位置。 Well, this is done by saying [-[] .嗯,这是通过说[-[]来完成的。 So we are done to match either - or [ .所以我们已经完成匹配-[

However, in this case it is not one but two hyphens: I want to say either -- or [ .但是,在这种情况下,它不是一个而是两个连字符:我想说--[ I cannot say [--[] because the hyphen also has a meaning to define a range.我不能说[--[]因为连字符也有定义范围的含义。

What I can do is to use -F"one pattern|another pattern" like:我能做的是使用-F"one pattern|another pattern"例如:

$ echo "XXXXXXXaaYYYYYbbZZZZ" | awk -F"aa|bb" '{print $2}'
YYYYY

So if I try to use this with -- and [ , I cannot get a proper result:因此,如果我尝试将其与--[一起使用,则无法获得正确的结果:

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F"--|[" '{print $2}'
awk: fatal: Invalid regular expression: /--|[/

And in fact, not even having [ as one of the terms:事实上,甚至没有[作为术语之一:

$ echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"bb|[" '{print $2}'
awk: fatal: Invalid regular expression: /bb|[/

$ echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"bb|\[" '{print $2}'
awk: warning: escape sequence `\[' treated as plain `['
awk: fatal: Invalid regular expression: /bb|[/

$ echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"(bb|\[)" '{print $2}'
awk: warning: escape sequence `\[' treated as plain `['
awk: fatal: Unmatched [ or [^: /(bb|[)/

You see I tried to either escaping [ , enclosing in parentheses and nothing worked.你看我试图转义[ ,用括号括起来,但没有任何效果。

So: what can I do to set the field separator to either -- or [ ?那么:我该怎么做才能将字段分隔符设置为--[ Is it possible at all?有可能吗?

IMHO this is best explained if we start by looking at a regexp being used by the split() command since that explicitly shows what is happening when a string is split into fields using a literal vs dynamic regexp and then we can relate that to Field Separators.恕我直言,如果我们从查看 split() 命令使用的正则表达式开始,这是最好的解释,因为它明确显示了使用文字与动态正则表达式将字符串拆分为字段时发生的情况,然后我们可以将其与字段分隔符相关联.

This uses a literal regexp (delimited by / s):这使用文字正则表达式(由/ s 分隔):

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk '{split($0,f,/\[|--/); print f[2]}'
YYYYY

and so requires the [ to be escaped so it is taken literally since [ is a regexp metacharacter.因此需要[被转义,所以它是字面意思,因为[是一个正则表达式元字符。

These use a dynamic regexp (one stored as a string):这些使用动态正则表达式(一个存储为字符串):

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk '{split($0,f,"\\[|--"); print f[2]}'
YYYYY

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk 'BEGIN{re="\\[|--"} {split($0,f,re); print f[2]}'
YYYYY

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -v re='\\[|--' '{split($0,f,re); print f[2]}'
YYYYY

and so require the [ to be escaped 2 times since awk has to convert the string holding the regexp (a variable named re in the last 2 examples) to a regexp (which uses up one backslash) before it's used as the separator in the split() call (which uses up the second backslash).因此需要[转义 2 次,因为 awk 必须将包含正则表达式的字符串(在最后两个示例中名为re的变量)转换为正则表达式(使用一个反斜杠),然后才能将其用作拆分中的分隔符() 调用(用完第二个反斜杠)。

This:这个:

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -v re="\\\[|--" '{split($0,f,re); print f[2]}'
YYYYY

exposes the variable contents to the shell for it's evaluation and so requires the [ to be escaped 3 times since the shell parses the string first to try to expand shell variables etc. (which uses up one backslash) and then awk has to convert the string holding the regexp to a regexp (which uses up a second backslash) before it's used as the separator in the split() call (which uses up the third backslash).将变量内容暴露给 shell 以进行评估,因此需要[转义 3 次,因为 shell 首先解析字符串以尝试扩展 shell 变量等(使用一个反斜杠),然后 awk 必须转换字符串在将正则表达式用作 split() 调用中的分隔符(用完第三个反斜杠)之前,将其保存为正则表达式(用完第二个反斜杠)。

A Field Separator is just a regexp stored as variable named FS (like re above) with some extra semantics so all of the above applies to it to, hence:字段分隔符只是一个存储为名为 FS 的变量(如上面的re )的正则表达式,具有一些额外的语义,因此上述所有内容都适用于它,因此:

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F '\\[|--' '{print $2}'
YYYYY

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F "\\\[|--" '{print $2}'
YYYYY

Note that we could have used a bracket expression instead of escaping it to have the [ treated literally:请注意,我们可以使用方括号表达式而不是将其转义以按字面意思处理[

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk '{split($0,f,/[[]|--/); print f[2]}'
YYYYY

and then we don't have to worry about escaping the escapes as we add layers of parsing:然后我们不必担心在我们添加解析层时逃避转义:

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F "[[]|--" '{print $2}'
YYYYY

$ echo "XXXXXXX[YYYYY--ZZZZ" | awk -F '[[]|--' '{print $2}'
YYYYY

You need to use double backslash for escaping regex meta chars inside double quoted string so that it would be treated as regex meta character otherwise ( if you use single backslash ) it would be treated as ecape sequence.您需要使用双反斜杠来转义双引号字符串中的正则表达式元字符,以便将其视为正则表达式元字符,否则(如果您使用单反斜杠)它将被视为转义序列。

$ echo 'XXXXXXX[YYYYYbbZZZZ' | awk -v FS="bb|\\[" '{print $2}'
YYYYY

This with GNU Awk 3.1.7这与 GNU Awk 3.1.7

echo "XXXXXXX[YYYYY--ZZZZ" | awk -F"--|[[]" '{print $2}'    
echo "XXXXXXX[YYYYYbbZZZZ" | awk -F"bb|[[]" '{print $2}'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM