简体   繁体   中英

awk matches regex characters it shouldn't

My awk program does some odd character matching. Could you please explain what's going on or point me to relevant documentation.

Input file

| 29900 | St. James | ...
| 33010 | Boole / Kirk | ...

awk

awk '/\| ([0-9]{5}) \| ([^\|]*)/{print $2 $4}' input-file.txt

Result

29900St.
33010Boole
  • Why is the first capturing group $1 the leading | ? Usually $0 is the entire match and $1 is the first group.
  • Why does ([^\\|]*) stop at . and / instead of reading on? I basically tell it "all characters that are not |" after all.

By default, awk separates columns by whitespace, so for the record

| 29900 | St. James | ...

we have $1="|", $2="29900", $3="|", $4="St.", $5="James", $6="|" and $7="..."

Additionally, unlike Perl, awk does not store the contents of capturing parentheses anywhere ( gawk does though)

Seeing as you want to use pipes as separators, I'd suggest:

awk -F '[[:blank:]]*\\|[[:blank:]]*' -v OFS=, '$2 ~ /[0-9]{5}/ {print $2,$3}'
29900,St. James
33010,Boole / Kirk

If you're confused about seeing $2 and $3 in there instead of $1 and $2, consider that a field separator, by definition, separates two fields and must have a field before it and after it. The first field separator shows up at the beginning of each line, therefore there must be a field consisting of an empty string before it: $1 will be the empty string.

awk doesn't provide a way to access capture groups, it uses $<number> to access fields of the input file. It looks like you could do:

awk -F' *\| *' '{print $2 $3;}' input-file.txt

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM