繁体   English   中英

我想从 fasta 文件中解析序列和序列 ID,并将它们分配给 Dataframe。 我正在使用 biopython 中的 SeqIO 库

[英]I want to parse Sequences and sequence Ids from a fasta file and assign them to Dataframe. I am using SeqIO library from biopython

这是我的代码的样子。 假设文件路径是“文件”

seq_object = SeqIO.parse(file, "fasta")

sequences = []

for seq in seq_object:
    sequences.append(seq)

first_record = sequences[0]
first_record

输出看起来像这样

SeqRecord(seq=Seq('mfptsiisvlllnalqshaapllpsspstlafvpsvhapssssskssvhttsts...fr*'), id='Thaps3a_25099', name='Thaps3a_25099', description='Thaps3a_25099', dbxrefs=[])

要分配给数据框,我试过这种方式

seq_ids = []

seqs = []

seq_lengths = []

for record in sequences:
    seq_id = record.id
    sequence = record.seq
    length = len(sequence)
    
    seq_ids.append(seq_id)
    seqs.append(sequence)
    seq_lengths.append(length)

现在在数据框中,我得到了我不想要的逗号分隔序列。 我想要它们简单明了。 有什么建议?

df = pd.DataFrame()
df["Seq_id"]= seq_ids
df["Sequences"] = seqs
df["Sequence_length"] = seq_lengths

数据框看起来像这样

*Seq_id Sequences   Sequence_length
0   Thaps3a_25099   (m, f, p, t, s, i, i, s, v, l, l, l, n, a, l, ...   331
1   Thaps3a_10882   (m, v, k, q, i, a, v, a, t, c, m, t, l, a, s, ...   187
2   Thaps3a_255658  (f, g, g, e, g, f, l, l, f, f, l, g, l, g, f, ...   111
3   Thaps3a_21592   (m, k, a, s, i, l, t, a, l, s, i, l, s, v, a, ...   228
4   Thaps3a_261225  (m, l, t, i, l, s, l, l, e, w, m, a, s, r, w, ...   1317
... ... ... ...
13339   Thaps3a_24736   (m, a, e, w, a, s, h, k, t, a, t, n, m, p, p, ...   567
13340   Thaps3a_9764    (m, s, t, h, n, d, f, r, q, g, t, a, y, l, f, ...   981
13341   Thaps3a_3869    (m, p, f, p, f, f, g, f, g, q, s, d, p, a, a, ...   181
13342   Thaps3a_1985    (m, n, s, d, e, q, p, l, v, t, n, d, d, q, d, ...   416
13343   Thaps3a_25099   (m, a, e, d, d, y, h, l, i, s, e, e, p, s, s, ...   445*

只需使用str(record.seq)

from Bio import SeqIO

import pandas as pd

file = 'fasta.faa'

seq_object = SeqIO.parse(file, "fasta")

sequences = []

for seq in seq_object:
    sequences.append(seq)


first_record = sequences[0]

print(first_record)

seq_ids = []

seqs = []

seq_lengths = []

for record in sequences:
    seq_id = record.id
    sequence = str(record.seq)
    length = len(sequence)
    
    seq_ids.append(seq_id)
    seqs.append(sequence)
    seq_lengths.append(length)
    
    
    
df = pd.DataFrame()
df["Seq_id"]= seq_ids
df["Sequences"] = seqs
df["Sequence_length"] = seq_lengths    

print(df)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM