I'm learning awk programming language and i'm stuck to a problem here.
I've a file(awk.dat), having the following content:
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Maecenas pellentesque erat vel tortor consectetur condimentum.
Nunc enim orci, euismod id nisi eget, interdum cursus ex.
Curabitur a dapibus tellus.
Lorem ipsum dolor sit amet, consectetur adipiscing elit.
Aliquam interdum mauris volutpat nisl placerat, et facilisis.
I'm using below command:
awk 'BEGIN{RS="*, *";ORS="<<<---\n"} {print $0}' awk.dat
it's returning me the error:
awk: run time error: regular expression compile failed (missing operand)
*, *
FILENAME="" FNR=0 NR=0
While, if i use the command: awk 'BEGIN{RS=" *, *";ORS="<<<---\\n"} {print $0}' awk.dat
, it's giving me the required result.
I need to understand this part: RS=" *, *"
, the meaning of the space between double-quotes and *
before ,
, due to which it's throwing the error.
Expected Output:
Lorem ipsum dolor sit amet<<<---
consectetur adipiscing elit.
Maecenas pellentesque erat vel tortor consectetur condimentum.
Nunc enim orci<<<---
euismod id nisi eget<<<---
interdum cursus ex.
Curabitur a dapibus tellus.
Lorem ipsum dolor sit amet<<<---
consectetur adipiscing elit.
Aliquam interdum mauris volutpat nisl placerat<<<---
et facilisis.
<<<---
Thanks.
"[space1]*,[space2]*"
is a regex, it matches string with:
zero or many spaces (space1) followed by a comma, then followed by zero or many spaces (space2)
The first one "*,[space]*"
was wrong, because *
has special meaning in regex. It means that repeating the matched group/character zero or many times. You cannot put it at very beginning.
Be aware that, according to POSIX, RS
is defined as a single character and not a regular expression.
The first character of the string value of
RS
shall be the input record separator; a <newline> by default. IfRS
contains more than one character, the results are unspecified. IfRS
is null, then records are separated by sequences consisting of a <newline> plus one or more blank lines, leading or trailing blank lines shall not result in empty records at the beginning or end of the input, and a <newline> shall always be a field separator, no matter what the value ofFS
is.source: Awk Posix standard
This implies that RS=" *, *"
leads to undefined behaviour .
Other versions of awk, who implement extensions to POSIX, might have a different approach to what RS
stands for. Examples are GNU awk and mawk. Both implement RS
to be a regular expression, but both implementations are slightly different. The summary wrt to the usage of <asterisk> is:
| RS | awk (posix) | gawk | mawk |
|------+--------------+------------------+------------------|
| "*" | "<asterisk>" | "<asterisk>" | "<asterisk>" |
| "*c" | undefined | "<asterisk>c" | undefined |
| "c*" | undefined | "","c","ccc",... | "","c","ccc",... |
c is any character
The above should explain the error of the OP as RS="*, *"
is an invalid regular expression according to mawk.
$ echo "abc" | ./mawk '/*c/'
mawk: line 1: regular expression compile failed (missing operand)
GNU awk: The manual of GNU awk states the following:
When using
gawk
, the value ofRS
is not limited to a one-character string. It can be any regular expression (see Regexp ). (ce) In general, each record ends at the next string that matches the regular expression; the next record starts at the end of the matching string.source: GNU awk manual
To understand the usage of <asterisk> in the regular expression in GNU awk, we find:
<asterisk>
*
This symbol means that the preceding regular expression should be repeated as many times as necessary to find a match. For example,ph*
applies the*
symbol to the precedingh
and looks for matches of onep
followed by any number ofh
s. This also matches justp
if noh
s are present.There are two subtle points to understand how
*
works. First, the*
applies only to the single preceding regular expression component (eg, inph*
, it applies just to theh
). To cause*
to apply to a larger subexpression, use parentheses:(ph)*
matchesph
,phph
,phphph
, and so on.Second,
*
finds as many repetitions as possible. If the text to be matched isphhhhhhhhhhhhhhooey
,ph*
matches all of theh
s.source: GNU Regular expression operators
It must be mentioned, however that:
In
POSIX awkand gawk, the*
,+
and?
operators stand for themselves when there is nothing in the regexp that precedes them. For example,/+/
matches a literal plus sign. However, many other versions of awk treat such a usage as a syntax error.source: GNU Regular expression operators
Thus, setting RS="*, *"
, implies that it would match the strings "*,"
, "*, "
, "*, "
, ...
$ echo "a,b, c" | awk 'BEGIN{RS="*, *"}1'
a,b, c
$ echo "a*,b, c" | awk 'BEGIN{RS="*, *"}1'
a
b, c
mawk: The manual of GNU awk states the following:
12. Multi-line records
Sincemawk
interpretsRS
as a regular expression , multi-line records are easy.source:
man mawk
but
11. Splitting strings, records and files
Awk programs use the same algorithm to split strings into arrays withsplit()
, and records into fields onFS
. mawk uses essentially the same algorithm to split files into records onRS
.
Split(expr,A,sep)
works as follows:
- <snip>
- If
sep = " "
(a single space), then <SPACE> is trimmed from the front and back ofexpr
, andsep
becomes <SPACE>. mawk defines <SPACE> as the regular expression/[ \\t\\n]+/
. Otherwisesep
is treated as a regular expression, except that meta-characters are ignored for a string of length 1 , eg,split(x, A, "*")
andsplit(x, A, /\\*/)
are the same.- <snip>
source:
man mawk
The manual makes no mention of how a regular expression starting with a meta-character should be interpreted (eg. "*c")
Note: in the GNU awk section I struck through POSIX awk, as, according to POSIX, a regular expression of the form "*, "
leads to undefined behaviour. (This is independent of defining RS
as RS
is anyway not an ERE in POSIX awk)
The awk utility shall make use of the extended regular expression notation (see XBD Extended Regular Expressions )
source: Awk Posix standard
and
*+?{
The <asterisk>, <plus-sign>, <question-mark>, and <left-brace> shall be special except when used in a bracket expression (see RE Bracket Expression). Any of the following uses produce undefined results:
- If these characters appear first in an ERE , or immediately following an unescaped <vertical-line>, <circumflex>, <dollar-sign>, or <left-parenthesis>
- If a <left-brace> is not part of a valid interval expression (see EREs Matching Multiple Characters)
您能否再尝试一次。
awk '{gsub(", ","<<<---" ORS)} 1;END{print "<<<---"}' Input_file
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.