I am trying to find a way to remove the lines (of a tab delimited table) if the value (string) in column x is repeated, including the first instance, but only if the difference between columns y and z is not the same for all replicates. I know that this command removes duplicates after the first instance: awk '!seen[$3]++' filename, but I want to include the first line where the duplicate value was found. Here is an example of what I am trying to do:
x y z
a 10 20
b 15 25
b 15 30
b 10 20
c 15 20
d 20 30
e 10 20
e 15 25
e 5 15
f 30 40
Would become:
x y z
a 10 20
c 15 20
d 20 30
e 10 20
e 15 25
e 5 15
f 30 40
Here all lines that had "b" in column x were removed since more than one line had "b" in that column AND because the difference between values in column y and z was not always the same for these lines. Lines with "e" in column x stayed because the difference between the values in y and z was always 10.
Any help would be very appreciated!
Note: I am a beginner with awk
awk
to the rescue!
double pass algorithm, mark and sweep
$ awk 'NR==FNR{if($1 in a) {if(a[$1]!=$3-$2) d[$1]}
else a[$1]=$3-$2; next}
!($1 in d)' file{,}
a 10 20
c 15 20
d 20 30
e 10 20
e 15 25
e 5 15
e 30 40
Explanation
NR==FNR
in the first scan of the file
if($1 in a)
if first field is already seen
if(a[$1]!=$3-$2)
but the delta is different from earlier
d[$1]
add the field to the delete list
else a[$1]=$3-$2
if the field wasn't seen before, add the field with delta
next
proceed to the next record until all lines are done
we're now in the second scan
!($1 in d)
print all the lines if not in the delete list compiled above
file{,}
bash shorthand for writing file file
With GNU awk for true multi-dimensional arrays:
$ awk 'NR==FNR{a[$1][$3-$2]; next} length(a[$1])==1' file file
x y z
a 10 20
c 15 20
d 20 30
e 10 20
e 15 25
e 5 15
f 30 40
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.