简体   繁体   中英

Awk: how to compare two strings in one line

I have a dataset with 20 000 probes, they are in two columns, 21nts each. From this file I need to extract the lines in which last nucleotide in Probe1 column matches last nucleotide in in Probe 2 column. So far I tried AWK (substr) function, but didn't get the expected outcome. Here is one-liner I tried:

awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'

Another option would be to anchor last character in columns 2 and 4 ( awk '$2~/[AZ]$/ ), but I can't find a way to match the probes in two columns using regex. All suggestions and comments will be very much appreciated.

Example of dataset:

        Probe 1                     Probe 2
4736    GGAGGAAGAGGAGGCGGAGGA   A   GGAGGACGAGGAGGAGGAGGA
4737    GGAGGAAGAGGAGGGAGAGGG   B   GGAGGACGAGGAGGAGGAGGG
4738    GGAGGATTTGGCCGGAGAGGC   C   GGAGGAGGAGGAGGACGAGGT
4739    GGAGGAAGAGGAGGGGGAGGT   D   GGAGGACGAGGAGGAGGAGGC
4740    GGAGGAAGAGGAGGGGGAGGC   E   GGAGGAGGAGGACGAGGAGGC

Desired output:

4736    GGAGGAAGAGGAGGCGGAGGA   A   GGAGGACGAGGAGGAGGAGGA
4737    GGAGGAAGAGGAGGGAGAGGG   B   GGAGGACGAGGAGGAGGAGGG
4740    GGAGGAAGAGGAGGGGGAGGC   E   GGAGGAGGAGGACGAGGAGGC

This will filter the input, matching lines where the last character of the 2nd column is equal to the last character of the 4th column:

awk 'substr($2, length($2), 1) == substr($4, length($4), 1)'

What I changed compared to your sample script:

  • Move the if statement out of the { ... } block into a filter
  • Use length($2) and length($4) instead of hardcoding the value 21
  • The { print $0 } is not needed, as that is the default action for the matched lines

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM