简体   繁体   中英

Bash Script Unix. Pattern matching

How can I write the script for multiple strings in a single line with no space:

acgtttgggcccagctctccgccctcacacacaccccggggt

for visual purpose:

acg ttt ggg ccc agc tct ccg ccc tca cac aca ccc cgg ggt

and will have to match the 4th 3 letter sequence repeated 2 times. so in the above sequence we have ccc as the 4th seq. and it is repeated again after agc tct ccg.

so would I have to use grep for it?

Then how about:

#!/bin/bash

# add a space every three letters
str="acgtttgggcccagctctccgccctcacacacaccccggggt"
result=$(sed -e 's/\(...\)/\1 /g' <<< "$str")
echo $result

# check if the 4th sequence is repeated two times
awk '
{   ref = $4;                       # set the 4th sequence as a reference
    for (i=5; i<=NF; i++)           # iterate from 5th sequence to the end
        if (ref == $i) count++      # count the same one as the reference
    printf "4th sequence \"%s\" repeated %d times.\n", ref, count
}' <<< "$result"

which yields:

acg ttt ggg ccc agc tct ccg ccc tca cac aca ccc cgg ggt
4th sequence "ccc" repeated 2 times.  

The script is composed of two parts: 1st one to split the string with spaces, and the 2nd one to count the repetition of the 4th triplet.

  • The sed script sed -e 's/\\(...\\)/\\1 /g' inserts a space after every three letters.
  • The awk script loops over the sequences for the one which is same as the 4th triplet.
  • If you just want to make sure the repetition is exactly two times or not, you may modify the script to compare count with 2.

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM