Awk: how to compare two strings in one line

Question

I have a dataset with 20 000 probes, they are in two columns, 21nts each. From this file I need to extract the lines in which last nucleotide in Probe1 column matches last nucleotide in in Probe 2 column. So far I tried AWK (substr) function, but didn't get the expected outcome. Here is one-liner I tried:

awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'

Another option would be to anchor last character in columns 2 and 4 ( awk '$2~/[AZ]$/ ), but I can't find a way to match the probes in two columns using regex. All suggestions and comments will be very much appreciated.

Example of dataset:

        Probe 1                     Probe 2
4736    GGAGGAAGAGGAGGCGGAGGA   A   GGAGGACGAGGAGGAGGAGGA
4737    GGAGGAAGAGGAGGGAGAGGG   B   GGAGGACGAGGAGGAGGAGGG
4738    GGAGGATTTGGCCGGAGAGGC   C   GGAGGAGGAGGAGGACGAGGT
4739    GGAGGAAGAGGAGGGGGAGGT   D   GGAGGACGAGGAGGAGGAGGC
4740    GGAGGAAGAGGAGGGGGAGGC   E   GGAGGAGGAGGACGAGGAGGC

Desired output:

4736    GGAGGAAGAGGAGGCGGAGGA   A   GGAGGACGAGGAGGAGGAGGA
4737    GGAGGAAGAGGAGGGAGAGGG   B   GGAGGACGAGGAGGAGGAGGG
4740    GGAGGAAGAGGAGGGGGAGGC   E   GGAGGAGGAGGACGAGGAGGC

Answer 1

This will filter the input, matching lines where the last character of the 2nd column is equal to the last character of the 4th column:

awk 'substr($2, length($2), 1) == substr($4, length($4), 1)'

What I changed compared to your sample script:

Move the if statement out of the { ... } block into a filter
Use length($2) and length($4) instead of hardcoding the value 21
The { print $0 } is not needed, as that is the default action for the matched lines

Awk: how to compare two strings in one line

Question

1 answers

solution1
5 ACCPTED 2016-11-27 14:38:17

Awk: how to compare two strings in one line

Question

1 answers

solution1 5 ACCPTED 2016-11-27 14:38:17

solution1
5 ACCPTED 2016-11-27 14:38:17