简体   繁体   English

linux - sed 循环遍历文件中的行列表

[英]linux - sed loop through a list of lines in a file

I am having some trouble looping through a list of lines in File A and using each line to find a match in File B and then to print out multiple lines in File B.我在遍历文件 A 中的行列表并使用每一行在文件 B 中查找匹配项然后在文件 B 中打印出多行时遇到了一些麻烦。

This is how File A looks like这就是文件 A 的样子

Nitab4.5_0000062g0520.1 Nitab4.5_0000062g0520.1

Nitab4.5_0000436g0070.1 Nitab4.5_0000436g0070.1

Nitab4.5_0000375g0110.1 Nitab4.5_0000375g0110.1

This is how File B looks like这就是文件 B 的样子

Nitab4.5_0000062g0520.1 Zinc finger, CCHC-type, Fibronectin-binding A, N-terminal, Domain of unknown function DUF814, Protein of unknown function DUF3441 MVKVRMNTADVAAEVKCLRRLIGMRCSNVYDLSPKTYVFKLMNSSGVTESGESEKVLLLM ESGVRLHTTDYLRDKSNTPSGFTLKLRKHIRTRRLEDVRQLGYDRIVLFQFGLGANAHYV ILELYAQGNILLTDSDFMVMTLLRSHRDDDKGLAIMSRHRYPVEICRVFKRTTTEKLQAA LMSSAETDKNEGVEDNEQGNDGSDALQQKQGNRKNIKATDSTKKMIDGVRAKSPTLKVVL GEALGYGPALSEHIILDAGLVPNAKIGKGFELEGEMLHSLIEAVKQFEDWLEDVILGEKV PEGYILMQQKALSKKDSSMCNNGASEKMYDEFCPLLLNQFKSRDFMKFEAFNAALDEFYS KIESQRSEQQQKAKESTAMQKLNKIRTDQENRVVTLKQEVEHCIKTAELIEYNLEDVDAA ILAVRVALANGMSWEDLARMVKEEKRSGNPVAGLIDKLHLERNCMTLLLSNNLDEMDDDE KTQPVDKVEVDLALSAHANARRWYEMKKRQESKQEKTVTAHEKAFKAAERKTRLQLSQEK TVAVISHMRKVHWFEKFNWFVSSENYLVISGRDAQQNEMIVKRYMSKGDLYVHAELHGAS STVIKNHKPEMPIPPLTLNQAGCFTVCQSQAWDSKIVTSAWWVYPNQVSKTAPTGEYLTV GSFMIRGKKNFLPPHPLIMGFGILFRLDESSLGFHLNERRVRGEEEGLNDAEQSDPSLAI PDSDSEEELSMETSVDKDITDVPNDRSSVAGTSYEVQSNSLLSISDDKVTNSHNSSVKVN SINNDGLSDSLGIMATSGTSQLEDLIDRALEIGSSTASTKNHGVPPLLGSAGQ Nitab4.5_0000062g0520.1锌指,CCHC型,纤连蛋白结合A,N末端,功能未知DUF814,未知功能DUF3441 MVKVRMNTADVAAEVKCLRRLIGMRCSNVYDLSPKTYVFKLMNSSGVTESGESEKVLLLM ESGVRLHTTDYLRDKSNTPSGFTLKLRKHIRTRRLEDVRQLGYDRIVLFQFGLGANAHYV ILELYAQGNILLTDSDFMVMTLLRSHRDDDKGLAIMSRHRYPVEICRVFKRTTTEKLQAA LMSSAETDKNEGVEDNEQGNDGSDALQQKQGNRKNIKATDSTKKMIDGVRAKSPTLKVVL GEALGYGPALSEHIILDAGLVPNAKIGKGFELEGEMLHSLIEAVKQFEDWLEDVILGEKV PEGYILMQQKALSKKDSSMCNNGASEKMYDEFCPLLLNQFKSRDFMKFEAFNAALDEFYS KIESQRSEQQQKAKESTAMQKLNKIRTDQENRVVTLKQEVEHCIKTAELIEYNLEDVDAA ILAVRVALANGMSWEDLARMVKEEKRSGNPVAGLIDKLHLERNCMTLLLSNNLDEMDDDE KTQPVDKVEVDLALSAHANARRWYEMKKRQESKQEKTVTAHEKAFKAAERKTRLQLSQEK TVAVISHMRKVHWFEKFNWFVSSENYLVISGRDAQQNEMIVKRYMSKGDLYVHAELHGAS STVIKNHKPEMPIPPLTLNQAGCFTVCQSQAWDSKIVTSAWWVYPNQVSKTAPTGEYLTV GSFMIRGKKNFLPPHPLIMGFGILFRLDESSLGFHLNERRVRGEEEGLNDAEQSDPSLAI PDSDSEEELSMETSVDKDITDVPNDRSSVAGTSYEVQSNSLLSISDDKVTNSHNSSVKVN SINNDGLSDSLGIMATSGTSQLEDLIDRALEIGSSTASTKNHGVPPLLGSAGQ的蛋白的结构域QDNEEKK VTQREKPYITKAERRKLKKGSDSTEGAPARQEKQSEKNQKAQKQCDEDVNNSKSGGGKVI RGQKGKLKKIKEKYADQDEEERRIRMALLASAGKVEKVDQTIQSEKVDAEPDKGAKATTG PEDASKICYKCKKVGHLSRDCQENSDESLQSTANGGDGHSLTSAGNAANDRDRIVMEEED IHEIGEEEKEKLNDVDYLTGNPLPNDILLYAVPVCGPYNALQSYKYRVKLVPGTVKKGKA AKTAMNLFSHMPEATSREKELMKACTDPELVAAVKGNVKITSAGLTQLKQKQKKSKKSNK AES QDNEEKK VTQREKPYITKAERRKLKKGSDSTEGAPARQEKQSEKNQKAQKQCDEDVNNSKSGGGKVI RGQKGKLKKIKEKYADQDEEERRIRMALLASAGKVEKVDQTIQSEKVDAEPDKGAKATTG PEDASKICYKCKKVGHLSRDCQENSDESLQSTANGGDGHSLTSAGNAANDRDRIVMEEED IHEIGEEEKEKLNDVDYLTGNPLPNDILLYAVPVCGPYNALQSYKYRVKLVPGTVKKGKA AKTAMNLFSHMPEATSREKELMKACTDPELVAAVKGNVKITSAGLTQLKQKQKKSKKSNK AES

Nitab4.5_0000375g0110.1 Tetratricopeptide-like helical, NSF attachment protein, Tetratricopeptide repeat, Malate dehydrogenase, active site, Tetratricopeptide repeat-containing domain MGDQIARGEEFEKKAEKKLSGWGLFGSKHDDAADLFDKAANCFKLAKSWDQAGAVYVKVA NCYLKLDSKHEAAGAYANAAHCYKKTNTREAISCLEQAVHMFLDIGRLNMSARYYKEIAE LYEQEQNLEQAIIYYEKAADLFQSEDVTTSANQCKQKIAQFSAELEKYQRAIEIFEEIAR HSVNNNLLKYGVRGHLLNAGICQLCKGDVVAINNALERYQELDPTFSGTRECKLLVDLAA AIDEEDVAKFTGSVKEYDSMTKLDALRTTLLLRVKEALKAKELEEDDLT Nitab4.5_0000375g0110.1三十四肽状螺旋,NSF附着蛋白,三十四肽重复,苹果酸脱氢酶,活性位点,三角形四肽重复含域MGDQIARGEEFEKKAEKKLSGWGLFGSKHDDAADLFDKAANCFKLAKSWDQAGAVYVKVA NCYLKLDSKHEAAGAYANAAHCYKKTNTREAISCLEQAVHMFLDIGRLNMSARYYKEIAE LYEQEQNLEQAIIYYEKAADLFQSEDVTTSANQCKQKIAQFSAELEKYQRAIEIFEEIAR HSVNNNLLKYGVRGHLLNAGICQLCKGDVVAINNALERYQELDPTFSGTRECKLLVDLAA AIDEEDVAKFTGSVKEYDSMTKLDALRTTLLLRVKEALKAKELEEDDLT

Nitab4.5_0000062g0530.1 DNA polymerase, palm domain, DNA-directed DNA polymerase, family B, conserved site, DNA-directed DNA polymerase, family B, multifunctional domain, DNA-directed DNA polymerase, family B MARVTGVPISFLLARGQSIKVLSQLLRKARQRNLVIPNVKQAGSEQGTYEGATVLEARAG FYEKPIATLDFASLYPSIMMAYNLCYCTLVTPEEFHKLNLCEVDVNKTPSGEMFVKSDLQ KGILPEILEELLAARKRAKADLKEAKDPLVKAVLDGRQLALKISANSVYGFTGATVGQLP CLEISSSVTSYGRQMIEKTKKLVEDKFTVLKGYEHNAEVIYGDTDSVMVQFGVPTVEEAM KLGREAADHISETFIKPLRLEFEKIYYPYLLISKKRYAGLLWTNPDKHDKMDAKGELLAT Nitab4.5_0000062g0530.1 DNA聚合酶,棕榈域,DNA指导的DNA聚合酶,家族B,保守位点,DNA指导的DNA聚合酶,家族B,多功能结构域,DNA指导的DNA聚合酶,家族B MARVTGVPISFLLARGQSIKVLSQLLRKARQRNLVIPNVKQAGSEQGTYEGATVLEARAG FYEKPIATLDFASLYPSIMMAYNLCYCTLVTPEEFHKLNLCEVDVNKTPSGEMFVKSDLQ KGILPEILEELLAARKRAKADLKEAKDPLVKAVLDGRQLALKISANSVYGFTGATVGQLP CLEISSSVTSYGRQMIEKTKKLVEDKFTVLKGYEHNAEVIYGDTDSVMVQFGVPTVEEAM KLGREAADHISETFIKPLRLEFEKIYYPYLLISKKRYAGLLWTNPDKHDKMDAKGELLAT

Nitab4.5_0005502g0010.1 CDC6, C-terminal domain, P-loop containing nucleoside triphosphate hydrolase, Cell division protein Cdc6/18, Winged helix-turn-helix DNA-binding domain MPTIPVRRSPRISGGSKVAGQTVSRNEIGVSTPSKRKIRSDSTTEDNVVTSTLTPSPMEI SPCKWKSPRRCVNDSPKSPLNANRGDKTINLSKSPVKRRLSESFLEKPIWNPRDMEQLNA VKEALHVSRAPSNLVCRQVEQNRVLEFCKQAVKIEKAGSLYVCGCPGTGKSLSMEKVKEV LVNWADESGFQAPDILSVNCTSLSNTSDIFGKMLDKIQPRRKLNCSTAPLQYLQKMFSEK QQPAGTKMLLIVADELDYLITKDKVVLHELFMLTTSPFSRFILIGIANAIDLADRFLPKL QSMNCEYFPSCKPAVITFCAYSKDQIISILQQRFEKVASASGDMRKALWVCRLVNIAARL ADHSLTKSAIEMLEAEIRDSISSLDLPSLHGRVSYQHRDGACDKSPIHESNVVRVDHVAI ALSKAYRSPVVDTIQSLPQHQQIILCSAVKLFRGKKKDATIGELNISYLDVCKSTLIPPV GIMELSSMCRVLGDQGILKVGKAREEKLSRVTLKVDEADITFALQA Nitab4.5_0005502g0010.1 CDC6,C末端结构域,含有三磷酸核苷水解酶,细胞分裂蛋白的Cdc6 / 18,翼状螺旋 - 转角 - 螺旋DNA结合结构域MPTIPVRRSPRISGGSKVAGQTVSRNEIGVSTPSKRKIRSDSTTEDNVVTSTLTPSPMEI SPCKWKSPRRCVNDSPKSPLNANRGDKTINLSKSPVKRRLSESFLEKPIWNPRDMEQLNA VKEALHVSRAPSNLVCRQVEQNRVLEFCKQAVKIEKAGSLYVCGCPGTGKSLSMEKVKEV LVNWADESGFQAPDILSVNCTSLSNTSDIFGKMLDKIQPRRKLNCSTAPLQYLQKMFSEK QQPAGTKMLLIVADELDYLITKDKVVLHELFMLTTSPFSRFILIGIANAIDLADRFLPKL QSMNCEYFPSCKPAVITFCAYSKDQIISILQQRFEKVASASGDMRKALWVCRLVNIAARL ADHSLTKSAIEMLEAEIRDSISSLDLPSLHGRVSYQHRDGACDKSPIHESNVVRVDHVAI ALSKAYRSPVVDTIQSLPQHQQIILCSAVKLFRGKKKDATIGELNISYLDVCKSTLIPPV GIMELSSMCRVLGDQGILKVGKAREEKLSRVTLKVDEADITFALQA P环

Nitab4.5_0005502g0020.1 MVIEEQCDDEGVQPYIEQLMDGQNYSQAQTHDGQSNDFNNSADTEIQQNDDSGKTIDVQI NSRNQFIGKEGRKLASFLGIVARTPELTPLQCKKWD Nitab4.5_0005502g0020.1 MVIEEQCDDEGVQPYIEQLMDGQNYSQAQTHDGQSNDFNNSADTEIQQNDDSGKTIDVQI NSRNQFIGKEGRKLASFLGIVARTPELTPLQCKKWD

Nitab4.5_0005502g0030.1 MINERLRNNSERLNDHPPQSVAWEGDVYSQVLKNKKSGYVRGNIDLEDSSNEVKRLEQKV IELTKLNGKQNEEMSSMKPELLWMRKVMCKIAPNELYMSQNINEISIGQVTQIQKFKTFV LKH Nitab4.5_0005502g0030.1 MINERLRNNSERLNDHPPQSVAWEGDVYSQVLKNKKSGYVRGNIDLEDSSNEVKRLEQKV IELTKLNGKQNEEMSSMKPELLWMRKVMCKIAPNELYMSQNINEISIGQVTQIQKFKTFV LKH

Nitab4.5_0005502g0040.1 Ribosomal protein L10/acidic P0, Ribosomal protein L10/L12 MAVKVTKAEKKVNYDKKLCKLLDTYQQILIVGADNVGSNQLQMIRKGLRGDSIVLMGKNT MMKRSIRIHAEKTGNNAFLALIPCLVGNVGLIFTRGDLKEVSDEVSKYKVGAPARVGLVA PIDVVVPPGNTGLDPSQTSFFQVLNIPTKINKGTVEITIPVEIIKKGEKVGSSESALLSK LGIKPFSYGLIVQFVYDSGSVFSPEVLDLTEDDLIAKFAAGLSNVVGLSMLLSYPTLAAI PHMFINGYKNVLSFAIATEYSFPQAEKVKEYLKDPSKFATAIAAPVATKPAVKPATAKEE KKEEPAEEDDDDFVGGLFD Nitab4.5_0005502g0040.1核糖体蛋白L10 /酸性P0,核糖体蛋白L10 / L12 MAVKVTKAEKKVNYDKKLCKLLDTYQQILIVGADNVGSNQLQMIRKGLRGDSIVLMGKNT MMKRSIRIHAEKTGNNAFLALIPCLVGNVGLIFTRGDLKEVSDEVSKYKVGAPARVGLVA PIDVVVPPGNTGLDPSQTSFFQVLNIPTKINKGTVEITIPVEIIKKGEKVGSSESALLSK LGIKPFSYGLIVQFVYDSGSVFSPEVLDLTEDDLIAKFAAGLSNVVGLSMLLSYPTLAAI PHMFINGYKNVLSFAIATEYSFPQAEKVKEYLKDPSKFATAIAAPVATKPAVKPATAKEE KKEEPAEEDDDDFVGGLFD

I wanted to print out the description lines (lines starting with >NitabXXXX) and the following amino acid sequences (the capitalized letters) in File B if the gene IDs (Nitab4.5_xxxxx) were found in File A.(In file B, the amino acid sequences were separated in multiple lines)如果在文件 A 中找到基因 ID (Nitab4.5_xxxxx),我想打印出文件 B 中的描述行(以 >NitabXXXX 开头的行)和以下氨基酸序列(大写字母)。(在文件 B 中,氨基酸序列被分成多行)

Here is the code I have come up with so far这是我到目前为止想出的代码

while IFS= read -r Gene_ID; do sed -n '/$Gene_ID/,/>Nitab4.5/p' File B | sed '$d'; done < File A 

The code worked with a specified Gene ID and no loop.该代码使用指定的基因 ID 并且没有循环。 But I was not able to get it to work after adding the loop.但是在添加循环后我无法让它工作。 I am new to Linux and sed.我是 Linux 和 sed 的新手。 Hope someone could point out the mistake and help me correct the code.希望有人能指出错误并帮助我纠正代码。 Thank you!!谢谢!!

Your question is a bit confusing, but could it be that this simple command is what you're looking for?你的问题有点令人困惑,但是这个简单的命令可能就是你要找的吗?

grep -f FILE_A -A 1 FILE_B

The options do the following:选项执行以下操作:

-f FILE -f 文件
Obtain patterns from FILE, one per line.从 FILE 获取模式,每行一个。 The empty file contains zero patterns, and therefore matches nothing.空文件包含零个模式,因此不匹配任何内容。
-A NUM -A NUM
Print NUM lines of trailing context after matching lines.在匹配行后打印 NUM 行尾随上下文。 Places a line containing a group separator (described under --group-separator) between contiguous groups of matches.在连续的匹配组之间放置包含组分隔符(在 --group-separator 下描述)的行。

First, let's try to print the third entry in FileB .首先,让我们尝试打印FileB的第三个条目。 (I am calling it FileB instead of File B because spaces in file names are big headaches.) (我称它为FileB而不是File B因为文件名中的空格很麻烦。)

sed -n '/Nitab4.5_0000062g0530.1/,/>Nitab4.5/p' FileB
>Nitab4.5_0000062g0530.1 DNA polymerase, palm domain, DNA-directed DNA polymerase, family B, conserved site, DNA-directed DNA polymerase, family B, multifunctional domain, DNA-directed DNA polymerase, family B
MARVTGVPISFLLARGQSIKVLSQLLRKARQRNLVIPNVKQAGSEQGTYEGATVLEARAG
FYEKPIATLDFASLYPSIMMAYNLCYCTLVTPEEFHKLNLCEVDVNKTPSGEMFVKSDLQ
KGILPEILEELLAARKRAKADLKEAKDPLVKAVLDGRQLALKISANSVYGFTGATVGQLP
CLEISSSVTSYGRQMIEKTKKLVEDKFTVLKGYEHNAEVIYGDTDSVMVQFGVPTVEEAM
KLGREAADHISETFIKPLRLEFEKIYYPYLLISKKRYAGLLWTNPDKHDKMDAKGELLAT

>Nitab4.5_0005502g0010.1 CDC6, C-terminal domain, P-loop containing nucleoside triphosphate hydrolase, Cell division protein Cdc6/18, Winged helix-turn-helix DNA-binding domain

It picked up the first line of the next entry.它拾取了下一个条目的第一行。 So instead of terminating at ">Nitab4.5", let's terminate at the empty line:因此,与其在 ">Nitab4.5" 处终止,不如在空行处终止:

sed -n '/Nitab4.5_0000062g0530.1/,/^$/p' FileB
>Nitab4.5_0000062g0530.1 DNA polymerase, palm domain, DNA-directed DNA polymerase, family B, conserved site, DNA-directed DNA polymerase, family B, multifunctional domain, DNA-directed DNA polymerase, family B
MARVTGVPISFLLARGQSIKVLSQLLRKARQRNLVIPNVKQAGSEQGTYEGATVLEARAG
FYEKPIATLDFASLYPSIMMAYNLCYCTLVTPEEFHKLNLCEVDVNKTPSGEMFVKSDLQ
KGILPEILEELLAARKRAKADLKEAKDPLVKAVLDGRQLALKISANSVYGFTGATVGQLP
CLEISSSVTSYGRQMIEKTKKLVEDKFTVLKGYEHNAEVIYGDTDSVMVQFGVPTVEEAM
KLGREAADHISETFIKPLRLEFEKIYYPYLLISKKRYAGLLWTNPDKHDKMDAKGELLAT

Now to do it with a variable:现在用一个变量来做:

line=Nitab4.5_0000062g0530.1; sed -n '/$line/,/^$/p' FileB

We get nothing, because the shell passed $line to sed, and sed has its own ideas about what that means.我们什么也没得到,因为 shell 将$line传递给 sed,而 sed 对这意味着什么有自己的想法。 To get the shell to expand the variable before passing it to sed, we must use double quotes:要在将变量传递给 sed 之前让 shell 展开变量,我们必须使用双引号:

line=Nitab4.5_0000062g0530.1; sed -n "/$line/,/^$/p" FileB
>Nitab4.5_0000062g0530.1 DNA polymerase, palm domain, DNA-directed DNA polymerase, family B, conserved site, DNA-directed DNA polymerase, family B, multifunctional domain, DNA-directed DNA polymerase, family B
MARVTGVPISFLLARGQSIKVLSQLLRKARQRNLVIPNVKQAGSEQGTYEGATVLEARAG
FYEKPIATLDFASLYPSIMMAYNLCYCTLVTPEEFHKLNLCEVDVNKTPSGEMFVKSDLQ
KGILPEILEELLAARKRAKADLKEAKDPLVKAVLDGRQLALKISANSVYGFTGATVGQLP
CLEISSSVTSYGRQMIEKTKKLVEDKFTVLKGYEHNAEVIYGDTDSVMVQFGVPTVEEAM
KLGREAADHISETFIKPLRLEFEKIYYPYLLISKKRYAGLLWTNPDKHDKMDAKGELLAT

If this is satisfactory, we can start on the loop.如果这令人满意,我们可以开始循环。 Always start with something simple:总是从简单的事情开始:

while read line; do echo $line; done < FileA
Nitab4.5_0000062g0520.1

Nitab4.5_0000436g0070.1

Nitab4.5_0000375g0110.1

Those empty lines are a pain, so let's remove them.那些空行很痛苦,所以让我们删除它们。 We can do that several ways, but since we're using sed anyway, let's use sed:我们可以通过多种方式做到这一点,但由于我们无论如何都在使用 sed,让我们使用 sed:

sed '/^$/d' FileA | while read line; do echo $line; done
Nitab4.5_0000062g0520.1
Nitab4.5_0000436g0070.1
Nitab4.5_0000375g0110.1

Now we put it all together:现在我们把它们放在一起:

sed '/^$/d' FileA | while read line; do sed -n "/$line/,/^$/p" FileB; done 

Thank you for updating your input file.感谢您更新输入文件。 If awk is your option, would you please try the following:如果awk是您的选择,请您尝试以下操作:

awk '
    BEGIN {RS=ORS="\n\n"; FS="\n"}
    NR==FNR {
        for (i=1; i<=NF; i++) nitab[$i]
        next
    }
    {
        if (match($1, /^>[^[:blank:]]+/)) {
            str = substr($1, 0, RLENGTH)
            if (str in nitab) print
        }
    }
' FileA FileB

Output:输出:

>Nitab4.5_0000062g0520.1 Zinc finger, CCHC-type, Fibronectin-binding A, N-terminal, Domain of unknown function DUF814, Protein of unknown function DUF3441
MVKVRMNTADVAAEVKCLRRLIGMRCSNVYDLSPKTYVFKLMNSSGVTESGESEKVLLLM
ESGVRLHTTDYLRDKSNTPSGFTLKLRKHIRTRRLEDVRQLGYDRIVLFQFGLGANAHYV
ILELYAQGNILLTDSDFMVMTLLRSHRDDDKGLAIMSRHRYPVEICRVFKRTTTEKLQAA
LMSSAETDKNEGVEDNEQGNDGSDALQQKQGNRKNIKATDSTKKMIDGVRAKSPTLKVVL
GEALGYGPALSEHIILDAGLVPNAKIGKGFELEGEMLHSLIEAVKQFEDWLEDVILGEKV
PEGYILMQQKALSKKDSSMCNNGASEKMYDEFCPLLLNQFKSRDFMKFEAFNAALDEFYS
KIESQRSEQQQKAKESTAMQKLNKIRTDQENRVVTLKQEVEHCIKTAELIEYNLEDVDAA
ILAVRVALANGMSWEDLARMVKEEKRSGNPVAGLIDKLHLERNCMTLLLSNNLDEMDDDE
KTQPVDKVEVDLALSAHANARRWYEMKKRQESKQEKTVTAHEKAFKAAERKTRLQLSQEK
TVAVISHMRKVHWFEKFNWFVSSENYLVISGRDAQQNEMIVKRYMSKGDLYVHAELHGAS
STVIKNHKPEMPIPPLTLNQAGCFTVCQSQAWDSKIVTSAWWVYPNQVSKTAPTGEYLTV
GSFMIRGKKNFLPPHPLIMGFGILFRLDESSLGFHLNERRVRGEEEGLNDAEQSDPSLAI
PDSDSEEELSMETSVDKDITDVPNDRSSVAGTSYEVQSNSLLSISDDKVTNSHNSSVKVN
SINNDGLSDSLGIMATSGTSQLEDLIDRALEIGSSTASTKNHGVPPLLGSAGQQDNEEKK
VTQREKPYITKAERRKLKKGSDSTEGAPARQEKQSEKNQKAQKQCDEDVNNSKSGGGKVI
RGQKGKLKKIKEKYADQDEEERRIRMALLASAGKVEKVDQTIQSEKVDAEPDKGAKATTG
PEDASKICYKCKKVGHLSRDCQENSDESLQSTANGGDGHSLTSAGNAANDRDRIVMEEED
IHEIGEEEKEKLNDVDYLTGNPLPNDILLYAVPVCGPYNALQSYKYRVKLVPGTVKKGKA
AKTAMNLFSHMPEATSREKELMKACTDPELVAAVKGNVKITSAGLTQLKQKQKKSKKSNK
AES

>Nitab4.5_0000375g0110.1 Tetratricopeptide-like helical, NSF attachment protein, Tetratricopeptide repeat, Malate dehydrogenase, active site, Tetratricopepti
de repeat-containing domain
MGDQIARGEEFEKKAEKKLSGWGLFGSKHDDAADLFDKAANCFKLAKSWDQAGAVYVKVA
NCYLKLDSKHEAAGAYANAAHCYKKTNTREAISCLEQAVHMFLDIGRLNMSARYYKEIAE
LYEQEQNLEQAIIYYEKAADLFQSEDVTTSANQCKQKIAQFSAELEKYQRAIEIFEEIAR
HSVNNNLLKYGVRGHLLNAGICQLCKGDVVAINNALERYQELDPTFSGTRECKLLVDLAA
AIDEEDVAKFTGSVKEYDSMTKLDALRTTLLLRVKEALKAKELEEDDLT

[Explanations] [说明]

  • The BEGIN block assigns the input/output record reparators to double newlines and the field separator to a newline. BEGIN块将输入/输出记录修复器分配给双换行符,将字段分隔符分配给换行符。 It enables to handle the paragraph (a group of the description line and the amino acid lines) as a record.它能够将段落(一组描述行和氨基酸行)作为记录处理。
  • The condition FR==FNR returns TRUE while reading the 1st file in the argument list (= FileA) only.条件FR==FNR在读取参数列表 (= FileA) 中的第一个文件时返回 TRUE。 The idiom is useful to switch the procedure depending on the input files.该习语对于根据输入文件切换程序很有用。
  • The loop for (i=1; i<=NF; i++) nitab[$i] stores each line of FileA in an array nitab . for (i=1; i<=NF; i++) nitab[$i]的循环将nitab每一行存储在一个数组nitab
  • The next statement match($1, /^>[^[:blank:]]+/) extracts the >NitabXXX substring of a recored in FileB which corresponds to the lines of FileA.下一个语句match($1, /^>[^[:blank:]]+/)提取>NitabXXX中记录的>NitabXXX子字符串,它对应于 FileA 的行。
  • Then the variable str is assigned to the substring.然后将变量str分配给子字符串。
  • If str matches any entries of the array nitab , then print the record.如果str匹配数组nitab任何条目,则打印记录。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM