简体   繁体   中英

Parse CSV with empty fields, escaped quotes and commas with awk

I have been using happily gawk with FPAT. Here's the script I use for my examples:

#!/usr/bin/gawk -f

BEGIN {
    FPAT="([^,]*)|(\"[^\"]+\")"
}

{
    for (i=1; i<=NF; i++) {
        printf "Record #%s, field #%s: %s\n", NR, i, $i
    }
}

Simple, no quotes

Works well.

$ echo 'a,b,c,d' | ./test.awk 
Record #1, field #1: a
Record #1, field #2: b
Record #1, field #3: c
Record #1, field #4: d

With quotes

Works well.

$ echo '"a","b",c,d' | ./test.awk 
Record #1, field #1: "a"
Record #1, field #2: "b"
Record #1, field #3: c
Record #1, field #4: d

With empty columns and quotes

Works well.

$ echo '"a","b",,d' | ./test.awk 
Record #1, field #1: "a"
Record #1, field #2: "b"
Record #1, field #3: 
Record #1, field #4: d

With escaped quotes, empty columns and quotes

Works well.

$ echo '"""a"": aaa","b",,d' | ./test.awk 
Record #1, field #1: """a"": aaa"
Record #1, field #2: "b"
Record #1, field #3: 
Record #1, field #4: d

With a column containing escaped quotes and ending with a comma

Fails.

$ echo '"""a"": aaa,","b",,d' | ./test.awk 
Record #1, field #1: """a"": aaa
Record #1, field #2: ","
Record #1, field #3: b"
Record #1, field #4: 
Record #1, field #5: d

Expected output:

$ echo '"""a"": aaa,","b",,d' | ./test_that_would_be_working.awk 
Record #1, field #1: """a"": aaa,"
Record #1, field #2: "b"
Record #1, field #4: 
Record #1, field #5: d

Is there a regex for FPAT that would make this work, or is this just not supported by awk?

The pattern would be " followed by anything but a single " . The regex class search works one character at a time so it can't not match a "" .

I think there may be an option with lookaround, but I'm not good enough with it to make it work.

Because awk's FPAT doesn't know lookarounds, you need to be explicit in your patterns. This one will do:

FPAT="[^,\"]*|\"([^\"]|\"\")*\""

Explanation:

[^,\"]*             # match 0 or more times any character except , and "
|                   # OR
\"                  # match '"'
  ([^\"]            #   followed by 0 or more anything but '"'
   |                #   OR
   \"\"             #   '""'
  )*        
\"                  # ending with '"'

Now testing it:

$ cat tst.awk
BEGIN {
    FPAT="[^,\"]*|\"([^\"]|\"\")*\""
}
{ 
   for (i=1; i<=NF; i++){ printf "Record #%s, field #%s: %s\n", NR, i, $i }
}


$ echo '"""a"": aaa,","b",,d' | awk -f tst.awk
Record #1, field #1: """a"": aaa,"
Record #1, field #2: "b"
Record #1, field #3:
Record #1, field #4: d

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM