简体   繁体   中英

How to take lines between two patterns when one pattern is a variable in bash/awk (dynamic regexp)

I am trying to combine my understanding of dynamic regular expressions with awk's ability to print lines between two patterns in order to obtain lines between two patterns that could be bash variables. In this specific instance, the first pattern is a bash variable, and the other pattern is the following occurrence of a wildcard that begins with ">". The data looks something like:

CGCGCGCGCGCGCGCGCGCGCGCG
>jcf719000004955    0-783586
ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
>jcf_anything   0-999999
TATATATATATATATATATATATA
TATATATATATATATATATATATA

And I would like to obtain just:

ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT

So, using these variables:

i="jcf719000004955"
data="/bin/file"

Neither of these matching patterns work:

awk '/^\>$i/{f=1;next} /^\>.*/{f=0} f' $data
awk '/^\>$i/{f=0} f; /^\>.*/{f=1}' $data

I'm able to use dynamic regular expressions to get the matching pattern containing my bash variable as such:

awk -v var="$i" '$0 ~ var ' $data | head -1
>jcf719000004955    0-783586

But how do I combine the use of dynamic regular expressions in order to obtain the lines in between two variables/patterns?

You can use the following gawk command:

i=jcf719000004955; awk -v var="$i" '$0~"^>"var{f=1; next}/^[^>]/{if(f)print;next}/^>/{if(f)exit}' input.txt

input:

CGCGCGCGCGCGCGCGCGCGCGCG
>jcf719000004955    0-783586
ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
>jcf_anything   0-999999
TATATATATATATATATATATATA
TATATATATATATATATATATATA 

output:

ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT

explanations:

  • -v var="$i" this is to pass a shell variable to your awk command in order to access it inside of your awk script.
  • by default variable are initiated to 0 in awk

the awk script:

# Rule(s)

$0 ~ ("^>"var) { #when the line starts with > and the value of your shell variabl
        f = 1 #set f to 1 
        next  #go to next line
}

/^[^>]/ { #when the line does not start with a >, 
        if (f) { #check if f is equal to 1
                print $0 #if it is the case it prints the whole line on your stdrout
        }
        next # jump to next line
}

/^>/ { #if we reach this point, it means that the line starts with > but has another value that what is stored in your variable so we reset
 if(f) { #if f was at 1 we have already passed by the printing section and we can exit
       exit
 }
}

test result:

在此输入图像描述

你也可以尝试这个

awk -F'\n' -v RS='>' -v i="$i" '$1 ~ i {for(j=2;j<NF;j++) print $j}' infile

Following awk could help you in same too.

i="jcf719000004955"
data="/bin/file"
awk -v val="$i" '/^>/{match($0,val);if(substr($0,RSTART,RLENGTH)){flag=1} else {flag=""};next} flag' "$data"

Output will be as follows.

ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT
ACGTACGTACGTACGTACGTACGT

Explanation: Adding explanation for above code too now.

i="jcf719000004955"              ##Setting variable named i value as per OP mentioned.
data="yout_file"                 ##Setting value for variable named data to the Input_file for awk here in data shell variable.
awk -v val="$i" '                ##Setting variable named val for awk who has value of variable i in it. In awk we define variables by -v option.
/^>/{                            ##Checking condition here if a line is starting from > then do following:
  match($0,val);                 ##Using match function of awk where we are trying to match variable val in current line, if it is TRUE then 2 variables named RSTART and RLENGTH for math function will be having values. RSTAR will have the index of matching regex and RLENGTH will have complete length of that matched regex.
  if(substr($0,RSTART,RLENGTH)){ ##Checking here if substring is NOT NULL which starts from RSTART to RLENGTH, if value is NOT NULL then do following:
    flag=1 }                     ##Setting variable flag value to TRUE here.
  else{                          ##In case substring value is NULL then do following:
    flag=""};                    ##Setting variable flag value to NULL.
next                             ##next is awk out of the box keyword which will skip all further statements now.
}
flag                             ##Checking condition here if variable flag value is NOT NULL and NOT mentioning any action, so by default print of current line will happen.
' "$data"                        ##Mentioning the value of variable data with double quotes as this is having Input_file value which awk will read.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM