简体   繁体   中英

bash: grep row if fields are equal

I need to find all the rows which have the same content in all the columns, by examining all CSV files in the system. Example:

MYCOL;1;2;3;4
MYCOL2;2;3;4;5
MYCOL3;1;1;1;1
MYCOL4;;;;

In my example, I would need to grep for MYCOL3 and MYCOL4, as they both have the same field content for all columns, it doesn't matter if there is no content.

I thought about something like this:

find / -name *.csv | xargs awk -F "," '{col[$1,$2]++} END {for(i in col) print i, col[i]}'

But I'm missing the comparison between all the columns.

You could use grep command:

$ grep -xE '[^;]*(;[^;]*)\1+' ip.txt
MYCOL3;1;1;1;1
MYCOL4;;;;
  • -x to match only whole line
  • [^;]* first field
  • (;[^;]*) capture ; followed by non ; characters (ie second field)
  • \1+ use the captured field to be repeated as many times as needed till end of the line

If input has only ASCII characters, you could use LC_ALL=C grep <...> to get the results faster.

If you have GNU grep , you can use -r option and --include= option instead of find+grep

Also, use find <...> -exec grep <...> {} + instead of find + xargs


Just did a sample speed check, this regex may be too bad to use with BRE/ERE. Use grep -P if available. Otherwise, use awk or perl .

$ perl -0777 -ne 'print $_ x 1000000' ip.txt | shuf > f1
$ du -h f1
53M    f1

$ time LC_ALL=C grep -xE '[^;]*(;[^;]*)\1+' f1 > t1
real    0m44.815s

$ time LC_ALL=C grep -xP '[^;]*(;[^;]*)\1+' f1 > t2
real    0m0.507s

$ time perl -ne 'print if /^[^;]*(;[^;]*)\1+$/' f1 > t3
real    0m3.973s

$ time LC_ALL=C awk -F ';' '{for (i=3; i<=NF; i++) if ($i != $2) next} 1' f1 > t4
real    0m2.728s

$ diff -sq t1 t2
Files t1 and t2 are identical
$ diff -sq t1 t3
Files t1 and t3 are identical
$ diff -sq t1 t4
Files t1 and t4 are identical

A non-regex approach using awk :

awk -F ';' '{for (i=3; i<=NF; i++) if ($i != $2) next} 1' file
MYCOL3;1;1;1;1
MYCOL4;;;;

Another awk

$ awk -F";" -v OFS=";" ' { a=$0; $1=""; c=split($0,ar,$2); if(length($0)==NF-1 || c==NF) print a } ' gipsy.txt
MYCOL3;1;1;1;1
MYCOL4;;;;
$

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM