I need to find all the rows which have the same content in all the columns, by examining all CSV files in the system. Example:
MYCOL;1;2;3;4
MYCOL2;2;3;4;5
MYCOL3;1;1;1;1
MYCOL4;;;;
In my example, I would need to grep for MYCOL3 and MYCOL4, as they both have the same field content for all columns, it doesn't matter if there is no content.
I thought about something like this:
find / -name *.csv | xargs awk -F "," '{col[$1,$2]++} END {for(i in col) print i, col[i]}'
But I'm missing the comparison between all the columns.
You could use grep
command:
$ grep -xE '[^;]*(;[^;]*)\1+' ip.txt
MYCOL3;1;1;1;1
MYCOL4;;;;
-x
to match only whole line [^;]*
first field (;[^;]*)
capture ;
followed by non ;
characters (ie second field)\1+
use the captured field to be repeated as many times as needed till end of the line If input has only ASCII characters, you could use LC_ALL=C grep <...>
to get the results faster.
If you have GNU grep
, you can use -r
option and --include=
option instead of find+grep
Also, use find <...> -exec grep <...> {} +
instead of find + xargs
Just did a sample speed check, this regex may be too bad to use with BRE/ERE. Use grep -P
if available. Otherwise, use awk
or perl
.
$ perl -0777 -ne 'print $_ x 1000000' ip.txt | shuf > f1
$ du -h f1
53M f1
$ time LC_ALL=C grep -xE '[^;]*(;[^;]*)\1+' f1 > t1
real 0m44.815s
$ time LC_ALL=C grep -xP '[^;]*(;[^;]*)\1+' f1 > t2
real 0m0.507s
$ time perl -ne 'print if /^[^;]*(;[^;]*)\1+$/' f1 > t3
real 0m3.973s
$ time LC_ALL=C awk -F ';' '{for (i=3; i<=NF; i++) if ($i != $2) next} 1' f1 > t4
real 0m2.728s
$ diff -sq t1 t2
Files t1 and t2 are identical
$ diff -sq t1 t3
Files t1 and t3 are identical
$ diff -sq t1 t4
Files t1 and t4 are identical
A non-regex approach using awk
:
awk -F ';' '{for (i=3; i<=NF; i++) if ($i != $2) next} 1' file
MYCOL3;1;1;1;1
MYCOL4;;;;
Another awk
$ awk -F";" -v OFS=";" ' { a=$0; $1=""; c=split($0,ar,$2); if(length($0)==NF-1 || c==NF) print a } ' gipsy.txt
MYCOL3;1;1;1;1
MYCOL4;;;;
$
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.