简体   繁体   English

如何从 python 中的 fasta 文件中提取 header

[英]How to extract header from a fasta file in python

I have a fasta file containing a sequence.我有一个包含序列的 fasta 文件。 I just want to extract the header information and display it.我只是想提取header信息并显示出来。

I am new in coding with python我是 python 编码的新手

According to the Fasta format, the header is always the first line in the file.根据 Fasta 格式,header 始终是文件的第一行。 So you just need to open the file, read one line, and you have it.所以你只需要打开文件,读一行,就可以了。

Ie IE

# The with open will open the file using "f" as the file handle. 
with open("/home/rightmire/Downloads/fastafile", "r") as f: 
    for line in f: # Creates a for loop to read the file line by line
        print(line) # This is the first line
        # If you comment out the break, the file will continue to be read line by line
        # If you want just the first line, you can break the loop
        break 

# even though the loop has ended, the last contents of the variable 'line' is remembered    
print("The data retained in the variable 'line' is: ", line)

OUTPUT: OUTPUT:

>gi|186681228|ref|YP_001864424.1| phycoerythrobilin:ferredoxin oxidoreductase

 The data retained in the variable 'line' is:  >gi|186681228|ref|YP_001864424.1| phycoerythrobilin:ferredoxin oxidoreductase

=== ===

You can also do it without the loop or 'with', should you choose.如果您选择,您也可以不使用循环或'with'。

f = open("/home/rightmire/Downloads/fastafile", "r")
line  = f.readline() # reads one line
print(line)
f.close() # Closes the open file. 

=== ===

Finally, you can read the entire file into memory, where you can manipulate the entire file as a whole, manipulate individual lines, or even parse the file character by character.最后,您可以将整个文件读入 memory,您可以在其中将整个文件作为一个整体进行操作,操作单个行,甚至逐个字符地解析文件。 This may not be the best idea however because the files can get HUGE!然而,这可能不是最好的主意,因为文件可能会变得巨大!

# The with open will open the file using "f" as the file handle. 
f = open("/home/rightmire/Downloads/fastafile", "r")
# Read the entire file into the variable 'lines'
lines  = f.read()
# Split 'lines' by the newline character to get individual lines. 
for line in lines.split("\n"): 
    print("--------")
    print(line)

# or even read it out character by character, which can be handy for parsing the genome data. 
for c in lines:
    print(c) 

OUTPUT: OUTPUT:

--------
>gi|186681228|ref|YP_001864424.1| phycoerythrobilin:ferredoxin oxidoreductase
--------
MNSERSDVTLYQPFLDYAIAYMRSRLDLEPYPIPTGFESNSAVVGKGKNQEEVVTTSYAFQTAKLRQIRA
--------
AHVQGGNSLQVLNFVIFPHLNYDLPFFGADLVTLPGGHLIALDMQPLFRDDSAYQAKYTEPILPIFHAHQ
--------
QHLSWGGDFPEEAQPFFSPAFLWTRPQETAVVETQVFAAFKDYLKAYLDFVEQAEAVTDSQNLVAIKQAQ
--------
LRYLRYRAEKDPARGMFKRFYGAEWTEEYIHGFLFDLERKLTVVK
--------

>
g
i
|
1
(snip)

M
N
S
E
(snip)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM