简体   繁体   中英

Using protected wildcard character in awk field separator doesn't work

I have a file that contains paragraphs separated by lines of *(any amount). When I use egrep with the regex of '^\*+$' it works as intended, only displaying the lines that contain only stars.

However, when I use the same expression in awk -F or awk FS, it doesn't work and just prints out the whole document, excluding the lines of stars.

Commands that I tried so far:

awk -F'^\*+$' '{print $1, $2}' msgs
awk -F'/^\*+$/' '{print $1, $2}' msgs
awk 'BEGIN{ FS="/^\*+$/" } ; { print $1,$2 }' msgs

Printing the first field always prints out the whole document, using the first version it excludes the lines with the stars, other versions include everything from the file.

Example input:

Par1 test teststsdsfsfdsf
fdsfdsfdsftesyt
fdsfdsfdsf fddsteste345sdfs
***
Par2 dsadawe232343a5edsfe
43s4esfsd s45s45e4t rfgsd45
***
Par3 dsadasd
fasfasf53sdf sfdsf s45 sdfs
dfsf dsf
***
Par4 dasdasda r3ar d afa fs
ds fgdsfgsdfaser ar53d f
***
Par 5 dasdawr3r35a
fsada35awfds46 s46 sdfsds5 34sdf
***

Expected output for print $1:

Par1 test teststsdsfsfdsf fdsfdsfdsftesyt fdsfdsfdsf fddsteste345sdfs

EDIT: Added example input and expected output

Strings used as regexps in awk are parsed twice:

  1. to turn them into a regexp, and
  2. to use them as a regexp.

So if you want to use a string as a regexp (including any time you assign a Field Separator or Record Separator as a regexp) then you need to double any escapes as each iteration of parsing will consume one of them. See https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps for details.

Good (a literal/constant regexp):

$ echo 'a(b)c' | awk '$0 ~ /\(b)/'
a(b)c

Bad (a poorly-written dynamic/computed regexp):

$ echo 'a(b)c' | awk '$0 ~ "\(b)"'
awk: cmd. line:1: warning: escape sequence `\(' treated as plain `('
a(b)c

Good (a well-written dynamic/computed regexp):

$ echo 'a(b)c' | awk '$0 ~ "\\(b)"'
a(b)c

but IMHO if you're having to double escapes to make a char literal then it's clearer to use a bracket expression instead:

$ echo 'a(b)c' | awk '$0 ~ "[(]b)"'
a(b)c

Also, ^ in a regexp means "start of string " which is only matched at the start of all the input, just like $ would only be matched at the end of all of the output. ^ does not mean "start of line " as some documents/scripts may lead you to believe. It only appears to mean that in grep and sed because they are line-oriented and so usually the script is being compared to 1 line at a time, but awk isnt line-oriented, it's record-oriented and so the input being compared to the regexp isn't necessarily just a line (the same is true in sed if you read multiple lines into its hold space).

So to match a line of * s as a Record Separator (RS) assuming you're using gawk or some other awk that can treat a multi-char RS as a regexp, you'd have to write this regexp:

(^|\n)[*]+(\n|$)

but be aware that also matches the newlines before the first and after the last * s on the target lines so you need to handle that appropriately in your code.

It seems like this is what you're really trying to do:

$ awk -v RS='(^|\n)[*]+(\n|$)' 'NR==1{$1=$1; print}' file
Par1 test teststsdsfsfdsf fdsfdsfdsftesyt fdsfdsfdsf fddsteste345sdfs

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM