awk matches regex characters it shouldn't

Question

My awk program does some odd character matching. Could you please explain what's going on or point me to relevant documentation.

Input file

| 29900 | St. James | ...
| 33010 | Boole / Kirk | ...

awk

awk '/\| ([0-9]{5}) \| ([^\|]*)/{print $2 $4}' input-file.txt

Result

29900St.
33010Boole

Why is the first capturing group $1 the leading | ? Usually $0 is the entire match and $1 is the first group.
Why does ([^\\|]*) stop at . and / instead of reading on? I basically tell it "all characters that are not |" after all.

Answer 1

By default, awk separates columns by whitespace, so for the record

| 29900 | St. James | ...

we have $1="|", $2="29900", $3="|", $4="St.", $5="James", $6="|" and $7="..."

Additionally, unlike Perl, awk does not store the contents of capturing parentheses anywhere ( gawk does though)

Seeing as you want to use pipes as separators, I'd suggest:

awk -F '[[:blank:]]*\\|[[:blank:]]*' -v OFS=, '$2 ~ /[0-9]{5}/ {print $2,$3}'

29900,St. James
33010,Boole / Kirk

If you're confused about seeing $2 and $3 in there instead of $1 and $2, consider that a field separator, by definition, separates two fields and must have a field before it and after it. The first field separator shows up at the beginning of each line, therefore there must be a field consisting of an empty string before it: $1 will be the empty string.

Answer 2

awk doesn't provide a way to access capture groups, it uses $<number> to access fields of the input file. It looks like you could do:

awk -F' *\| *' '{print $2 $3;}' input-file.txt

awk matches regex characters it shouldn't

Question

2 answers

solution1
3 ACCPTED 2013-09-20 22:57:10

solution2
2 2013-09-20 22:57:17

awk matches regex characters it shouldn't

Question

2 answers

solution1 3 ACCPTED 2013-09-20 22:57:10

solution2 2 2013-09-20 22:57:17

solution1
3 ACCPTED 2013-09-20 22:57:10

solution2
2 2013-09-20 22:57:17