简体   繁体   English

使用awk解析具有空字段,转义引号和逗号的CSV

[英]Parse CSV with empty fields, escaped quotes and commas with awk

I have been using happily gawk with FPAT. 我一直在用FPAT快乐地使用gawk。 Here's the script I use for my examples: 这是我用于示例的脚本:

#!/usr/bin/gawk -f

BEGIN {
    FPAT="([^,]*)|(\"[^\"]+\")"
}

{
    for (i=1; i<=NF; i++) {
        printf "Record #%s, field #%s: %s\n", NR, i, $i
    }
}

Simple, no quotes 简单,无引号

Works well. 效果很好。

$ echo 'a,b,c,d' | ./test.awk 
Record #1, field #1: a
Record #1, field #2: b
Record #1, field #3: c
Record #1, field #4: d

With quotes 带引号

Works well. 效果很好。

$ echo '"a","b",c,d' | ./test.awk 
Record #1, field #1: "a"
Record #1, field #2: "b"
Record #1, field #3: c
Record #1, field #4: d

With empty columns and quotes 空列和引号

Works well. 效果很好。

$ echo '"a","b",,d' | ./test.awk 
Record #1, field #1: "a"
Record #1, field #2: "b"
Record #1, field #3: 
Record #1, field #4: d

With escaped quotes, empty columns and quotes 带转义引号,空列和引号

Works well. 效果很好。

$ echo '"""a"": aaa","b",,d' | ./test.awk 
Record #1, field #1: """a"": aaa"
Record #1, field #2: "b"
Record #1, field #3: 
Record #1, field #4: d

With a column containing escaped quotes and ending with a comma 包含转义引号且以逗号结尾的列

Fails. 失败

$ echo '"""a"": aaa,","b",,d' | ./test.awk 
Record #1, field #1: """a"": aaa
Record #1, field #2: ","
Record #1, field #3: b"
Record #1, field #4: 
Record #1, field #5: d

Expected output: 预期产量:

$ echo '"""a"": aaa,","b",,d' | ./test_that_would_be_working.awk 
Record #1, field #1: """a"": aaa,"
Record #1, field #2: "b"
Record #1, field #4: 
Record #1, field #5: d

Is there a regex for FPAT that would make this work, or is this just not supported by awk? 是否有适用于FPAT的正则表达式可以使这项工作奏效,或者awk只是不支持这一点?

The pattern would be " followed by anything but a single " . 模式将是"后面跟着一个" The regex class search works one character at a time so it can't not match a "" . regex类搜索一次只能处理一个字符,因此不能匹配""

I think there may be an option with lookaround, but I'm not good enough with it to make it work. 我认为可能有一个环视选项,但我还不足以使其正常工作。

Because awk's FPAT doesn't know lookarounds, you need to be explicit in your patterns. 由于awk的FPAT不知道环顾四周,因此您需要在模式中明确显示。 This one will do: 这将做到:

FPAT="[^,\"]*|\"([^\"]|\"\")*\""

Explanation: 说明:

[^,\"]*             # match 0 or more times any character except , and "
|                   # OR
\"                  # match '"'
  ([^\"]            #   followed by 0 or more anything but '"'
   |                #   OR
   \"\"             #   '""'
  )*        
\"                  # ending with '"'

Now testing it: 现在对其进行测试:

$ cat tst.awk
BEGIN {
    FPAT="[^,\"]*|\"([^\"]|\"\")*\""
}
{ 
   for (i=1; i<=NF; i++){ printf "Record #%s, field #%s: %s\n", NR, i, $i }
}


$ echo '"""a"": aaa,","b",,d' | awk -f tst.awk
Record #1, field #1: """a"": aaa,"
Record #1, field #2: "b"
Record #1, field #3:
Record #1, field #4: d

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM