简体   繁体   English

如何根据标识符列表提取参考文本的特定部分?

[英]How to extract specific parts of a reference text, based on a list of identifiers?

I have a reference file (.fasta) and a list of gene IDs.我有一个参考文件 (.fasta) 和一个基因 ID 列表。 For each ID in the gene ID list, I need to get the corresponding sequence into a text file.对于基因ID列表中的每个ID,我需要将对应的序列放入一个文本文件中。 How can I automate this?我怎样才能自动化呢?

Things I've tried so far:到目前为止我尝试过的事情:

  1. sed sed

sed -n -e '/{GENEID1}/,/>/p' referencefile.fasta | sed $d >> seqs.txt

with '>' being the character at which I'd like sed to stop. '>' 是我希望 sed 停止的字符。 I need the second sed to remove the last line, which grabs the first line of the next sequence, too.我需要第二个 sed 来删除最后一行,这也抓住了下一个序列的第一行。 This works if I just run it once, but if I try如果我只运行一次,这有效,但如果我尝试

cat geneID.txt | xargs sed -n -e '/{}/,/>/p' referencefile.fasta >> seqs.txt

then I get just a list of IDs, with no sequences.然后我只得到一个 ID 列表,没有序列。 It also takes super long, so I assume sed is reading through the reference file, but I don't see why it won't grab the sequences?它也需要很长时间,所以我假设 sed 正在阅读参考文件,但我不明白为什么它不会抓取序列?

  1. grep grep

grep -o -P '(?={GENEID}).*(?=>)

Here I have the same issue - works individually, but not with xargs or a loop.在这里我有同样的问题 - 单独工作,但不适用于 xargs 或循环。

  1. using a for loop使用 for 循环

     for LINE in $(cat geneIDs.txt); do echo $LINE >> seqs.txt sed -n -e '/$LINE/,/>/p' referencefile.fasta | sed $d >> seqs.txt done

I'm also open to trying something in python, though I'm not that well-versed in it yet.我也愿意在 python 中尝试一些东西,尽管我还不是很精通它。 My preliminary attempt has been based on this question here .我的初步尝试是基于这里的这个问题 I have a test ID list of 10 lines, which I tried to run like this:我有一个 10 行的测试 ID 列表,我尝试这样运行:

t = open('test.txt', 'r')
test = t.readlines()
test = test.split()
t.close()

with open('referencefile.fasta', 'r') as ref:
    for line in ref:
        for i in test:
            if i in line:
                print(line)

This one, I couldn't even get a sequence from the reference file, regardless of the loop.这一个,我什至无法从参考文件中获得序列,无论循环如何。

Can you guys spot the issue?大家能看出问题吗? Why won't any of these give me sequences?为什么这些都不会给我序列?

Thanks in advance!提前致谢!

Edited to add:编辑添加:

Example reference:示例参考:

>000000F
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg


>000001F
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

>000002F

TGCGTGAGGTGCTAGGGATGACAATTGAAAAGAGGACATTGATCGATCACTTGACTCATTTCAGAAAGGAGTTTGGGTTGTCCAACAAGTTGAGGGGGATGATCATCAGGCATCCTGAGT TGCGTGAGGTGCTAGGGATGACAATTGAAAAGAGGACATTGATCGATCACTTGACTCATTTCAGAAAGGAGTTTGGGTTGTCCAACAAGTTGAGGGGGATGATCATCAGGCATCCTGAGT

test IDs: 000000F, 000001F测试 ID:000000F、000001F

Ideal result:理想结果:

000000F ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg 000000F ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg

000001F NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN 000001F NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Current result:当前结果:

000000F 000001F 000000F 000001F

If there is always a single line after one geneId in your fasta file, this will help:如果您的 fasta 文件中的一个geneId 后面总是有一行,这将有所帮助:

grep -A1 -Fwf geneIds.txt input.fasta

check this example:检查这个例子:

$  head -n 20 *
==> ids.txt <==
000000F
000001F

==> input.fasta <==
>000000F
Yes I want it!


>000001F
Yes I want it too!

>000002F
skip

>00000XYZ
skip

kent$  grep -A1 -Fwf ids.txt input.fasta
>000000F
Yes I want it!
--
>000001F
Yes I want it too!

depending on size and access patterns and what else you may use the sequence for it may be easiest to just build a BLAST database, then feed it your identifiers and it will return exactly what you are asking for (except correctly formatted FASTA).取决于大小和访问模式以及您可以使用的其他序列,因为它可能是最简单的构建一个 BLAST 数据库,然后将您的标识符提供给它,它会准确返回您要求的内容(格式正确的 FASTA 除外)。

pros are it is well designed, tested and fast优点是它设计精良,经过测试且速度快

cons are it may be overkill for your task缺点是对你的任务来说可能是多余的

(but still super useful if you will be continuing to work in this space) (但如果您将继续在这个领域工作,仍然非常有用)

https://duckduckgo.com/?q=build+a+blast+database&ia=web https://duckduckgo.com/?q=build+a+blast+database&ia=web

Given:鉴于:

$ cat file
>000000F
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg


>000001F
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

With awk you can read data separated by two or more \n in paragraph mode .使用awk您可以在段落模式下读取由两个或多个\n分隔的数据。 This allows you to easily build an associative database of a file in that format.这使您可以轻松地以该格式构建文件的关联数据库。

Example, search by exact string:示例,按确切字符串搜索:

awk -v RS= -v FS="\n" -v q=">000000F" '$1==q{print $2}' file
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg

Or search by regex:或通过正则表达式搜索:

awk -v RS= -v FS="\n" -v q="[01]F$" '$1~q {print $2}' file
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Or, build an associative array:或者,构建一个关联数组:

awk -v RS= -v FS="\n"   '{arr[$1]=$2} END{ "do something with the data in arr" }' file

Which you could use to print from a file with a list of ids:您可以使用它从具有 id 列表的文件中打印:

cat ids
>000001F
>000000F

awk -v RS= -v FS="\n"  'FNR==NR{for(i=1; i<=NF; i++) ids[$i]; next}
$1 in ids{print $2}' ids file
ctatcttcgaggttgccacctgtatcgaggagttggcgtctagatcacgaacatgtattttagctatcgtgagctcacacctgacggatccagctttcgaggtcacatcctcaagtctcg
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM