简体   繁体   English

RS以AWK语言

[英]RS in awk language

I'm learning awk programming language and i'm stuck to a problem here. 我正在学习awk编程语言,并且在这里遇到了问题。

I've a file(awk.dat), having the following content: 我有一个文件(awk.dat),具有以下内容:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Maecenas pellentesque erat vel tortor consectetur condimentum.
Nunc enim orci, euismod id nisi eget, interdum cursus ex.
Curabitur a dapibus tellus.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Aliquam interdum mauris volutpat nisl placerat, et facilisis.

I'm using below command: 我正在使用以下命令:

awk 'BEGIN{RS="*, *";ORS="<<<---\n"} {print $0}' awk.dat

it's returning me the error: 返回错误:

awk: run time error: regular expression compile failed (missing operand)
*, *
    FILENAME="" FNR=0 NR=0

While, if i use the command: awk 'BEGIN{RS=" *, *";ORS="<<<---\\n"} {print $0}' awk.dat , it's giving me the required result. 同时,如果我使用以下命令: awk 'BEGIN{RS=" *, *";ORS="<<<---\\n"} {print $0}' awk.dat ,它给了我所需的结果。

I need to understand this part: RS=" *, *" , the meaning of the space between double-quotes and * before , , due to which it's throwing the error. 我需要了解这一部分: RS=" *, *" ,双引号之间的空间的意义*之前,由于它的投掷的错误。

Expected Output: 预期产量:

Lorem ipsum dolor sit amet<<<---
consectetur adipiscing elit.
Maecenas pellentesque erat vel tortor consectetur condimentum.
Nunc enim orci<<<---
euismod id nisi eget<<<---
interdum cursus ex.
Curabitur a dapibus tellus.
Lorem ipsum dolor sit amet<<<---
consectetur adipiscing elit.
Aliquam interdum mauris volutpat nisl placerat<<<---
et facilisis.
<<<---

Thanks. 谢谢。

"[space1]*,[space2]*"

is a regex, it matches string with: 是一个正则表达式,它与以下字符串匹配:

zero or many spaces (space1) followed by a comma, then followed by zero or many spaces (space2) 零个或多个空格(space1),后跟一个逗号,然后是零个或多个空格(space2)

The first one "*,[space]*" was wrong, because * has special meaning in regex. 第一个"*,[space]*"是错误的,因为*在正则表达式中具有特殊含义。 It means that repeating the matched group/character zero or many times. 这意味着将匹配的组/字符重复零次或多次。 You cannot put it at very beginning. 您不能一开始就将其放置。

Be aware that, according to POSIX, RS is defined as a single character and not a regular expression. 请注意,根据POSIX, RS被定义为单个字符而不是正则表达式。

The first character of the string value of RS shall be the input record separator; RS字符串值的第一个字符应为输入记录分隔符; a <newline> by default. 默认情况下为<newline>。 If RS contains more than one character, the results are unspecified. 如果RS包含多个字符,则结果不确定。 If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is. 如果RS为空,则记录由由<newline>加上一个或多个空行组成的序列分隔,开头或结尾的空行在输入的开头或结尾不应导致空记录,而<newline>应不管FS的值是多少,始终是字段分隔符。

source: Awk Posix standard 来源: Awk Posix标准

This implies that RS=" *, *" leads to undefined behaviour . 这意味着RS=" *, *"导致未定义的行为

Other versions of awk, who implement extensions to POSIX, might have a different approach to what RS stands for. 实现POSIX扩展的其他版本的awk,对于RS含义可能有不同的方法。 Examples are GNU awk and mawk. 例如GNU awk和mawk。 Both implement RS to be a regular expression, but both implementations are slightly different. 两者都将RS实现为正则表达式,但是两种实现都略有不同。 The summary wrt to the usage of <asterisk> is: <asterisk>用法的摘要如下:

| RS   | awk (posix)  | gawk             | mawk             |
|------+--------------+------------------+------------------|
| "*"  | "<asterisk>" | "<asterisk>"     | "<asterisk>"     |
| "*c" | undefined    | "<asterisk>c"    | undefined        |
| "c*" | undefined    | "","c","ccc",... | "","c","ccc",... |

c is any character

The above should explain the error of the OP as RS="*, *" is an invalid regular expression according to mawk. 上面应该解释OP的错误,因为根据mawk RS="*, *"是无效的正则表达式。

$ echo "abc" | ./mawk '/*c/'
mawk: line 1: regular expression compile failed (missing operand)

GNU awk: The manual of GNU awk states the following: GNU awk:GNU awk的手册规定如下:

When using gawk , the value of RS is not limited to a one-character string. 使用gawkRS的值不限于一个字符的字符串。 It can be any regular expression (see Regexp ). 它可以是任何正则表达式 (请参阅Regexp )。 (ce) In general, each record ends at the next string that matches the regular expression; (ce)通常,每条记录在与正则表达式匹配的下一个字符串处结束; the next record starts at the end of the matching string. 下一条记录从匹配字符串的末尾开始。

source: GNU awk manual 来源: GNU awk手册

To understand the usage of <asterisk> in the regular expression in GNU awk, we find: 为了了解<asterisk>在GNU awk中的正则表达式中的用法,我们发现:

<asterisk> * This symbol means that the preceding regular expression should be repeated as many times as necessary to find a match. <星号> *此符号表示前面的正则表达式应重复多次以找到匹配项。 For example, ph* applies the * symbol to the preceding h and looks for matches of one p followed by any number of h s. 例如, ph**符号应用于前面的h并查找一个p匹配项,后跟任意数量的h s。 This also matches just p if no h s are present. 如果不存在h则也仅匹配p

There are two subtle points to understand how * works. 要了解*工作原理,有两点要* First, the * applies only to the single preceding regular expression component (eg, in ph* , it applies just to the h ). 首先, *仅适用于单个前面的正则表达式分量(例如,在ph* ,它仅适用于h )。 To cause * to apply to a larger subexpression, use parentheses: (ph)* matches ph , phph , phphph , and so on. 要使*应用于较大的子表达式,请使用括号: (ph)*匹配phphphphphph等。

Second, * finds as many repetitions as possible. 其次, *查找尽可能多的重复。 If the text to be matched is phhhhhhhhhhhhhhooey , ph* matches all of the h s. 如果要匹配的文本是phhhhhhhhhhhhhhooey ,则ph*匹配所有h s。

source: GNU Regular expression operators 来源: GNU正则表达式运算符

It must be mentioned, however that: 但必须指出的是:

In POSIX awk and gawk, the * , + and ? POSIX awk和gawk中, *+? operators stand for themselves when there is nothing in the regexp that precedes them. 当正则表达式中没有任何运算符时,运算符将代表自己。 For example, /+/ matches a literal plus sign. 例如, /+/匹配文字加号。 However, many other versions of awk treat such a usage as a syntax error. 但是,awk的许多其他版本将这种用法视为语法错误。

source: GNU Regular expression operators 来源: GNU正则表达式运算符

Thus, setting RS="*, *" , implies that it would match the strings "*," , "*, " , "*, " , ... 因此,设置RS="*, *"意味着它将与字符串"*,""*, ""*, " ,...相匹配。

$ echo "a,b, c" | awk 'BEGIN{RS="*, *"}1'
a,b, c
$ echo "a*,b, c" | awk 'BEGIN{RS="*, *"}1'
a
b, c

mawk: The manual of GNU awk states the following: mawk: GNU awk的手册规定如下:

12. Multi-line records 12.多行记录
Since mawk interprets RS as a regular expression , multi-line records are easy. 由于mawkRS解释为正则表达式 ,因此多行记录很容易。

source: man mawk 资料来源: man mawk

but

11. Splitting strings, records and files 11.分割字符串,记录和文件
Awk programs use the same algorithm to split strings into arrays with split() , and records into fields on FS . Awk程序使用相同的算法通过split()将字符串拆分为数组,并记录到FS字段中。 mawk uses essentially the same algorithm to split files into records on RS . mawk使用基本上相同的算法将文件拆分为RS记录。

Split(expr,A,sep) works as follows: Split(expr,A,sep)工作方式如下:

  1. <snip> <snip>
  2. If sep = " " (a single space), then <SPACE> is trimmed from the front and back of expr , and sep becomes <SPACE>. 如果sep = " " (单个空格),则从expr的前后修剪<SPACE>,并且sep变为<SPACE>。 mawk defines <SPACE> as the regular expression /[ \\t\\n]+/ . mawk将<SPACE>定义为正则表达式/[ \\t\\n]+/ Otherwise sep is treated as a regular expression, except that meta-characters are ignored for a string of length 1 , eg, split(x, A, "*") and split(x, A, /\\*/) are the same. 否则, sep会被视为正则表达式, 只是对于长度为1的字符串会忽略元字符 ,例如split(x, A, "*")split(x, A, /\\*/)相同。
  3. <snip> <snip>

source: man mawk 资料来源: man mawk

The manual makes no mention of how a regular expression starting with a meta-character should be interpreted (eg. "*c") 该手册提及应如何解释以元字符开头的正则表达式(例如“ * c”)


Note: in the GNU awk section I struck through POSIX awk, as, according to POSIX, a regular expression of the form "*, " leads to undefined behaviour. 注意:在GNU awk部分中,我介绍了POSIX awk,因为根据POSIX,形式为"*, "的正则表达式会导致未定义的行为。 (This is independent of defining RS as RS is anyway not an ERE in POSIX awk) (这与定义RS无关,因为RS在POSIX awk中始终不是ERE)

The awk utility shall make use of the extended regular expression notation (see XBD Extended Regular Expressions ) awk实用程序应使用扩展的正则表达式符号(请参阅XBD 扩展的正则表达式

source: Awk Posix standard 来源: Awk Posix标准

and

*+?{ The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except when used in a bracket expression (see RE Bracket Expression). *+?{除在方括号表达式中使用时,<asterisk>,<加号>,<question-mark>和<left-brace>应该是特殊的(请参阅RE方括号表达式)。 Any of the following uses produce undefined results: 以下任何一种用途都会产生不确定的结果:

  • If these characters appear first in an ERE , or immediately following an unescaped <vertical-line>, <circumflex>, <dollar-sign>, or <left-parenthesis> 如果这些字符首先出现在ERE中 ,或者紧随未转义的<vertical-line>,<circumflex>,<dollar-sign>或<left-parenthesis>之后出现
  • If a <left-brace> is not part of a valid interval expression (see EREs Matching Multiple Characters) 如果<left-brace>不是有效间隔表达式的一部分(请参阅匹配多个字符的ERE)

source: POSIX Extended Regular Expressions 来源: POSIX扩展正则表达式

您能否再尝试一次。

awk '{gsub(", ","<<<---" ORS)} 1;END{print "<<<---"}'   Input_file

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM