简体   繁体   中英

How to separate csv columns by awk, with a comma being the field separator?

My regex didn't work in a csv file with awk on its command line field separator .

My csv is separated by commas ( , ) but some fields has commas inside itself too.

The data.csv is like:

t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14
field without comma,f22,f23,f24
field without comma,f22,f23,f34

If we see in field, with comma,f12,f13,f14 , we have two kinds of commas:

  1. comma is part of the data (inside in the field), like field, with comma , and;
  2. comma is separating fields ,f12,f13,f14 .

So I tried awk, with -F and regex :

awk -F'/\B\,/\B/' '!seen[$2]++' data.csv > resulted.csv

My strategy was: the field separator needs to be a comma \\, in No-Word-Boundary \\B .

So, my command didn't outputted the resulted.csv . But outputted a warning:

gawk: warning: escape sequence `\B' treated as plain `B'
gawk: warning: escape sequence `\,' treated as plain `,'

And the desired result.csv will remove repeated lines, like:

t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14
field without comma,f22,f23,f24

Without GNU awk, with your data, you can use gsub to replace the ", " string with some non-conflicting characters such as "__" separate the fields as normal on "," and then restore the comma within the field (eg ", " ) using gsub again. For example:

 awk -F, -v OFS=, '
    { gsub(/, /,"__"); for (i = 1; i <= NF; i++) gsub(/__/,", ", $i) }
    !seen[$0]++
' file.csv

Above gsub(/, /,"__") replaces all occurrences of ", " with two-underscores in the input record. Then looping over each field, any "__" is replaced with ", " restoring the original comma in the field.

Example Use/Output

Given your data, the above results in:

$ awk -F, -v OFS=, '
>     { gsub(/, /,"__"); for (i = 1; i <= NF; i++) gsub(/__/,", ", $i) }
>     !seen[$0]++
> ' file.csv
t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14
field without comma,f22,f23,f24

With GNU awk:

awk -F ',[^ ]' '!seen[$2]++' data.csv

Output:

t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14
field without comma,f22,f23,f24

If the intent is to use the t2 column as a key value then this is how you'd do it:

$ awk -F, '!seen[$(NF-2)]++' data.csv
t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14
field without comma,f22,f23,f24

If it's to use the t1 column as the key instead then this is how you'd do that:

$ awk '{key=$0; sub(/(,[^,]+){3}$/,"",key)} !seen[key]++' data.csv
t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14

If it's something else then please clarify your question and update the example.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM