My regex didn't work in a csv file with awk on its command line field separator
.
My csv is separated by commas ( ,
) but some fields has commas inside itself too.
The data.csv
is like:
t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14
field without comma,f22,f23,f24
field without comma,f22,f23,f34
If we see in field, with comma,f12,f13,f14
, we have two kinds of commas:
field, with comma
, and; ,f12,f13,f14
. So I tried awk, with -F
and regex :
awk -F'/\B\,/\B/' '!seen[$2]++' data.csv > resulted.csv
My strategy was: the field separator
needs to be a comma \\,
in No-Word-Boundary \\B
.
So, my command didn't outputted the resulted.csv
. But outputted a warning:
gawk: warning: escape sequence `\B' treated as plain `B'
gawk: warning: escape sequence `\,' treated as plain `,'
And the desired result.csv
will remove repeated lines, like:
t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14
field without comma,f22,f23,f24
Without GNU awk, with your data, you can use gsub
to replace the ", "
string with some non-conflicting characters such as "__"
separate the fields as normal on ","
and then restore the comma within the field (eg ", "
) using gsub
again. For example:
awk -F, -v OFS=, '
{ gsub(/, /,"__"); for (i = 1; i <= NF; i++) gsub(/__/,", ", $i) }
!seen[$0]++
' file.csv
Above gsub(/, /,"__")
replaces all occurrences of ", "
with two-underscores in the input record. Then looping over each field, any "__"
is replaced with ", "
restoring the original comma in the field.
Example Use/Output
Given your data, the above results in:
$ awk -F, -v OFS=, '
> { gsub(/, /,"__"); for (i = 1; i <= NF; i++) gsub(/__/,", ", $i) }
> !seen[$0]++
> ' file.csv
t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14
field without comma,f22,f23,f24
With GNU awk:
awk -F ',[^ ]' '!seen[$2]++' data.csv
Output:
t1,t2,t3,t4 field without comma,f02,f03,f04 field, with comma,f12,f13,f14 field without comma,f22,f23,f24
If the intent is to use the t2
column as a key value then this is how you'd do it:
$ awk -F, '!seen[$(NF-2)]++' data.csv
t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14
field without comma,f22,f23,f24
If it's to use the t1
column as the key instead then this is how you'd do that:
$ awk '{key=$0; sub(/(,[^,]+){3}$/,"",key)} !seen[key]++' data.csv
t1,t2,t3,t4
field without comma,f02,f03,f04
field, with comma,f12,f13,f14
If it's something else then please clarify your question and update the example.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.