简体   繁体   中英

RS in awk language

I'm learning awk programming language and i'm stuck to a problem here.

I've a file(awk.dat), having the following content:

Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Maecenas pellentesque erat vel tortor consectetur condimentum.
Nunc enim orci, euismod id nisi eget, interdum cursus ex.
Curabitur a dapibus tellus.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Aliquam interdum mauris volutpat nisl placerat, et facilisis.

I'm using below command:

awk 'BEGIN{RS="*, *";ORS="<<<---\n"} {print $0}' awk.dat

it's returning me the error:

awk: run time error: regular expression compile failed (missing operand)
*, *
    FILENAME="" FNR=0 NR=0

While, if i use the command: awk 'BEGIN{RS=" *, *";ORS="<<<---\\n"} {print $0}' awk.dat , it's giving me the required result.

I need to understand this part: RS=" *, *" , the meaning of the space between double-quotes and * before , , due to which it's throwing the error.

Expected Output:

Lorem ipsum dolor sit amet<<<---
consectetur adipiscing elit.
Maecenas pellentesque erat vel tortor consectetur condimentum.
Nunc enim orci<<<---
euismod id nisi eget<<<---
interdum cursus ex.
Curabitur a dapibus tellus.
Lorem ipsum dolor sit amet<<<---
consectetur adipiscing elit.
Aliquam interdum mauris volutpat nisl placerat<<<---
et facilisis.
<<<---

Thanks.

"[space1]*,[space2]*"

is a regex, it matches string with:

zero or many spaces (space1) followed by a comma, then followed by zero or many spaces (space2)

The first one "*,[space]*" was wrong, because * has special meaning in regex. It means that repeating the matched group/character zero or many times. You cannot put it at very beginning.

Be aware that, according to POSIX, RS is defined as a single character and not a regular expression.

The first character of the string value of RS shall be the input record separator; a <newline> by default. If RS contains more than one character, the results are unspecified. If RS is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value of FS is.

source: Awk Posix standard

This implies that RS=" *, *" leads to undefined behaviour .

Other versions of awk, who implement extensions to POSIX, might have a different approach to what RS stands for. Examples are GNU awk and mawk. Both implement RS to be a regular expression, but both implementations are slightly different. The summary wrt to the usage of <asterisk> is:

| RS   | awk (posix)  | gawk             | mawk             |
|------+--------------+------------------+------------------|
| "*"  | "<asterisk>" | "<asterisk>"     | "<asterisk>"     |
| "*c" | undefined    | "<asterisk>c"    | undefined        |
| "c*" | undefined    | "","c","ccc",... | "","c","ccc",... |

c is any character

The above should explain the error of the OP as RS="*, *" is an invalid regular expression according to mawk.

$ echo "abc" | ./mawk '/*c/'
mawk: line 1: regular expression compile failed (missing operand)

GNU awk: The manual of GNU awk states the following:

When using gawk , the value of RS is not limited to a one-character string. It can be any regular expression (see Regexp ). (ce) In general, each record ends at the next string that matches the regular expression; the next record starts at the end of the matching string.

source: GNU awk manual

To understand the usage of <asterisk> in the regular expression in GNU awk, we find:

<asterisk> * This symbol means that the preceding regular expression should be repeated as many times as necessary to find a match. For example, ph* applies the * symbol to the preceding h and looks for matches of one p followed by any number of h s. This also matches just p if no h s are present.

There are two subtle points to understand how * works. First, the * applies only to the single preceding regular expression component (eg, in ph* , it applies just to the h ). To cause * to apply to a larger subexpression, use parentheses: (ph)* matches ph , phph , phphph , and so on.

Second, * finds as many repetitions as possible. If the text to be matched is phhhhhhhhhhhhhhooey , ph* matches all of the h s.

source: GNU Regular expression operators

It must be mentioned, however that:

In POSIX awk and gawk, the * , + and ? operators stand for themselves when there is nothing in the regexp that precedes them. For example, /+/ matches a literal plus sign. However, many other versions of awk treat such a usage as a syntax error.

source: GNU Regular expression operators

Thus, setting RS="*, *" , implies that it would match the strings "*," , "*, " , "*, " , ...

$ echo "a,b, c" | awk 'BEGIN{RS="*, *"}1'
a,b, c
$ echo "a*,b, c" | awk 'BEGIN{RS="*, *"}1'
a
b, c

mawk: The manual of GNU awk states the following:

12. Multi-line records
Since mawk interprets RS as a regular expression , multi-line records are easy.

source: man mawk

but

11. Splitting strings, records and files
Awk programs use the same algorithm to split strings into arrays with split() , and records into fields on FS . mawk uses essentially the same algorithm to split files into records on RS .

Split(expr,A,sep) works as follows:

  1. <snip>
  2. If sep = " " (a single space), then <SPACE> is trimmed from the front and back of expr , and sep becomes <SPACE>. mawk defines <SPACE> as the regular expression /[ \\t\\n]+/ . Otherwise sep is treated as a regular expression, except that meta-characters are ignored for a string of length 1 , eg, split(x, A, "*") and split(x, A, /\\*/) are the same.
  3. <snip>

source: man mawk

The manual makes no mention of how a regular expression starting with a meta-character should be interpreted (eg. "*c")


Note: in the GNU awk section I struck through POSIX awk, as, according to POSIX, a regular expression of the form "*, " leads to undefined behaviour. (This is independent of defining RS as RS is anyway not an ERE in POSIX awk)

The awk utility shall make use of the extended regular expression notation (see XBD Extended Regular Expressions )

source: Awk Posix standard

and

*+?{ The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except when used in a bracket expression (see RE Bracket Expression). Any of the following uses produce undefined results:

  • If these characters appear first in an ERE , or immediately following an unescaped <vertical-line>, <circumflex>, <dollar-sign>, or <left-parenthesis>
  • If a <left-brace> is not part of a valid interval expression (see EREs Matching Multiple Characters)

source: POSIX Extended Regular Expressions

您能否再尝试一次。

awk '{gsub(", ","<<<---" ORS)} 1;END{print "<<<---"}'   Input_file

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM