Using awk to remove all lines that contain a duplicated value in a specific column based on a calculation performed on other columns

Question

I am trying to find a way to remove the lines (of a tab delimited table) if the value (string) in column x is repeated, including the first instance, but only if the difference between columns y and z is not the same for all replicates. I know that this command removes duplicates after the first instance: awk '!seen[$3]++' filename, but I want to include the first line where the duplicate value was found. Here is an example of what I am trying to do:

x   y    z
a   10   20
b   15   25
b   15   30
b   10   20
c   15   20
d   20   30
e   10   20
e   15   25
e    5   15
f   30   40

Would become:

x   y    z
a   10   20
c   15   20
d   20   30
e   10   20
e   15   25
e    5   15
f   30   40

Here all lines that had "b" in column x were removed since more than one line had "b" in that column AND because the difference between values in column y and z was not always the same for these lines. Lines with "e" in column x stayed because the difference between the values in y and z was always 10.

Any help would be very appreciated!

Note: I am a beginner with awk

Answer 1

awk to the rescue!

double pass algorithm, mark and sweep

$ awk 'NR==FNR{if($1 in a) {if(a[$1]!=$3-$2) d[$1]} 
               else a[$1]=$3-$2; next} 
     !($1 in d)' file{,}

a   10   20
c   15   20
d   20   30
e   10   20
e   15   25
e    5   15
e   30   40

Explanation

NR==FNR in the first scan of the file

if($1 in a) if first field is already seen

if(a[$1]!=$3-$2) but the delta is different from earlier

d[$1] add the field to the delete list

else a[$1]=$3-$2 if the field wasn't seen before, add the field with delta

next proceed to the next record until all lines are done

we're now in the second scan

!($1 in d) print all the lines if not in the delete list compiled above

file{,} bash shorthand for writing file file

Answer 2

With GNU awk for true multi-dimensional arrays:

$ awk 'NR==FNR{a[$1][$3-$2]; next} length(a[$1])==1' file file
x   y    z
a   10   20
c   15   20
d   20   30
e   10   20
e   15   25
e    5   15
f   30   40

Using awk to remove all lines that contain a duplicated value in a specific column based on a calculation performed on other columns

Question

2 answers

solution1
4 ACCPTED 2016-08-03 21:08:11

solution2
0 2016-08-03 22:45:47

Using awk to remove all lines that contain a duplicated value in a specific column based on a calculation performed on other columns

Question

2 answers

solution1 4 ACCPTED 2016-08-03 21:08:11

solution2 0 2016-08-03 22:45:47

solution1
4 ACCPTED 2016-08-03 21:08:11

solution2
0 2016-08-03 22:45:47