简体   繁体   English

Bash:从fasta文件中删除行

[英]Bash : remove lines from fasta file

I wonder what is the best way to remove some lines from a fasta file in bash.我想知道从 bash 中的 fasta 文件中删除一些行的最佳方法是什么。

In the example above, let's say I want to remove the line where it's written 'GUITH', how do you remove this line and above lines, until you find some other '>' character ?在上面的例子中,假设我想删除写有“GUITH”的行,你如何删除这一行和上面的行,直到找到其他的'>'字符?

fasta file: fasta文件:

>B4KSI7_DROMO
RGLKRKPMALIKKLRKAKKEAPPNEKPEIVKTHLRNMIIVPEMTGSIIGVYNGKDFGQVE
VKPEMIGHYLGEFALTYKPVKH
>O46898_GUITH
RSLSKGPYIAAHLLKKLNNVDIQKPDVVIKTWSRSSTILPNMVGATIAVYNGKQHVPVYI
SDQMVGHKLGEFSPTRTFRSH
>Q7RT13_PLAYO
RGIDKKAKSLLKKLRKAKKECEVGEKPKPIPTHLRNMTIIPEMVGSIVAVHNGKQYTNVE
IKPEMIGYYLGEFSITYKHTRH

fasta file after filtering with bash:用 bash 过滤后的fasta文件:

>B4KSI7_DROMO
RGLKRKPMALIKKLRKAKKEAPPNEKPEIVKTHLRNMIIVPEMTGSIIGVYNGKDFGQVE
VKPEMIGHYLGEFALTYKPVKH
>Q7RT13_PLAYO
RGIDKKAKSLLKKLRKAKKECEVGEKPKPIPTHLRNMTIIPEMVGSIVAVHNGKQYTNVE
IKPEMIGYYLGEFSITYKHTRH

There is an other version of the question, but harder manipulation.这个问题还有另一个版本,但更难操作。 Let's say you have a file with species names :假设您有一个包含物种名称的文件:

species.txt : species.txt

DROMO;
PLAYO;

And you want to delete lines in the fasta file where species are not present in the species.txt document.并且您想删除 fasta 文件中物种不存在于物种.txt 文档中的行。 So you get the same output as above, but you get the lines to erase thanks to some other file (not entering 'GUITH' directly).所以你得到与上面相同的输出,但是由于其他一些文件(不直接输入'GUITH' ),你得到了删除的行。 What would be the best way of doing that ?这样做的最佳方法是什么?

To remove the line where it's written 'GUITH':要删除写有“GUITH”的行:

sed 's/>/\n&/' fasta.txt | sed '/_GUITH/,/^$/d' | sed '/^$/d'

To delete lines in the fasta file where species are not present in the species.txt:要删除 fasta 文件中物种在物种.txt 中不存在的行:

With GNU sed and bash:使用 GNU sed 和 bash:

sed 's/>/\n&/' fasta.txt | sed -n -f <( sed 's/;$//;s|.*|/_&$/,/^$/p|' species.txt ) | sed '/^$/d'

Output:输出:

>B4KSI7_DROMO
RGLKRKPMALIKKLRKAKKEAPPNEKPEIVKTHLRNMIIVPEMTGSIIGVYNGKDFGQVE
VKPEMIGHYLGEFALTYKPVKH
>Q7RT13_PLAYO
RGIDKKAKSLLKKLRKAKKECEVGEKPKPIPTHLRNMTIIPEMVGSIVAVHNGKQYTNVE
IKPEMIGYYLGEFSITYKHTRH

In awk:在 awk 中:

$ awk '/^>/{p=1} /GUITH/{p=0} p' file
>B4KSI7_DROMO
RGLKRKPMALIKKLRKAKKEAPPNEKPEIVKTHLRNMIIVPEMTGSIIGVYNGKDFGQVE
VKPEMIGHYLGEFALTYKPVKH
>Q7RT13_PLAYO
RGIDKKAKSLLKKLRKAKKECEVGEKPKPIPTHLRNMTIIPEMVGSIVAVHNGKQYTNVE
IKPEMIGYYLGEFSITYKHTRH

Explained:解释:

/^>/ { p=1 }    # turn print flag up for each record starting with >
/GUITH/ { p=0 } # turn print flag down for GUITH
p               # print if p

If you want to have a list of approved names:如果您想获得批准的名称列表:

$ cat list
DROMO
PLAYO
$ awk 'NR==FNR{a[$1];next} /^>/{n=split($0,b,"_"); p=(b[n] in a)} p' list file
>B4KSI7_DROMO
RGLKRKPMALIKKLRKAKKEAPPNEKPEIVKTHLRNMIIVPEMTGSIIGVYNGKDFGQVE
VKPEMIGHYLGEFALTYKPVKH
>Q7RT13_PLAYO
RGIDKKAKSLLKKLRKAKKECEVGEKPKPIPTHLRNMTIIPEMVGSIVAVHNGKQYTNVE
IKPEMIGYYLGEFSITYKHTRH

Explained:解释:

NR==FNR { a[$1]; next }                   # read the list to array a
/^>/ { n=split($0,b,"_"); p=(b[n] in a) } # take the word after _ and if in a, enable print
p                                         # if p, print

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM