[英]Python MemoryError - Is there a more efficient way of working with my huge CSV file?
[英]More efficient way to retrieve lines from a huge file
我有一個ID為1,786,916記錄的數據文件,我想從另一個包含約480萬條記錄的文件(在這種情況下為DNA序列,但基本上只是純文本)中檢索相應的記錄。 我寫了一個python腳本來做到這一點,但是要花很多時間才能運行(第3天,它只完成了12%)。 由於我是python的相對新手,所以我想知道是否有人建議加快此操作。
這是帶有ID的數據文件的示例(示例中的ID為ANICH889-10):
ANICH889-10 k__Animalia; p__Arthropoda; c__Insecta; o__Lepidoptera; f__Psychidae; g__Ardiosteres; s__Ardiosteres sp. ANIC9
ARONW984-15 k__Animalia; p__Arthropoda; c__Arachnida; o__Araneae; f__Clubionidae; g__Clubiona; s__Clubiona abboti
這是包含記錄的第二個文件的示例:
>ASHYE2081-10|Creagrura nigripesDHJ01|COI-5P|HM420985
ATTTTATACTTTTTATTAGGAATATGATCAGGAATAATTGGTCTTTCAATAAGAATCATTATCCGTATTGAATTAAGAAATCCAGGATCTATTATTAATAATGACCAAATTTATAATTCATTAATTACTATACACGCACTATTAATAATTTTTTTTTTAGTTATACCTGTAATAATTGGAGGATTTGGAAATTGATTAATTCCTATTATAATTGGAGCCCCAGATATAGCATTTCCACGAATAAACAATCTTAGATTTTGATTATTAATCCCATCAATTTTCATATTAATATTAAGATCAATTACTAATCAAGGTGTAGGAACAGGATGAACAATATATCCCCCATTATCATTAAATATAAATCAAGAAGGAATATCAATAGATATATCAATTTTTTCTTTACATTTAGCAGGAATATCCTCAATTTTAGGATCAATTAATTTCATTTCAACTATTTTAAATATAAAATTTATTAATTCTAATTATGATCAATTAACTTTATTTTCATGATCAATTCTAATTACTACTATTTTATTATTACTAGCAGTCCCTGTATTAGCAGGAGCAATTACTATAATTTTAACTGATCGAAATTTAAATACTTCTTTTTTTGATCCTAGAGGAGGAGGAGATCCAATTT-----------------
>BCISA145-10|Hemiptera|COI-5P
AACTCTATACTTTTTACTAGGATCCTGGGCAGGAATAGTAGGAACATCATTAAGATGAATAATCCGAATTGAACTAGGACAACCTGGATCTTTTATTGGAGATGACCAAACTTATAATGTAATTGTAACTGCCCACGCATTTGTAATAATTTTCTTTATAGTTATACCAATTATAATTGGAGGATTTGGAAATTGATTAATTCCCTTAATAATTGGAGCACCCGATATAGCATTCCCACGAATGAATAACATAAGATTTTGATTGCTACCACCGTCCCTAACACTTCTAATCATAAGTAGAATTACAGAAAGAGGAGCAGGAACAGGATGAACAGTATACCCTCCATTATCCAGAAACATCGCCCATAGAGGAGCATCTGTAGATTTAGCAATCTTTTCCCTACATCTAGCAGGAGTATCATCAATTTTAGGAGCAGTTAACTTCATTTCAACAATTATTAATATACGACCAGCAGGAATAACCCCAGAACGAATCCCATTATTTGTATGATCTGTAGGAATTACAGCACTACTACTCCTACTTTCATTACCCGTACTAGCAGGAGCCATTACCATACTCTTAACTGACCGAAACTTCAATACTTCTTTTTTTGACCCTGCTGGAGGAGGAGATCCCATCCTATATCAACATCTATTC
但是,在第二個文件中,DNA序列分為多行而不是單行,並且它們的長度並不總是相同。
編輯
這是我想要的輸出:
>ANICH889-10
GGGATTTGGTAATTGATTAGTTCCTTTAATA---TTGGGGGCCCCTGACATAGCTTTTCCTCGTATAAATAATATAAGATTTTGATTATTACCTCCCTCTCTTACATTATTAATTTCAAGAAGAATTGTAGAAAATGGAGCTGGGACTGGATGAACTGTTTACCCTCCTTTATCTTCTAATATCGCCCATAGAGGAAGCTCTGTAGATTTA---GCAATTTTCTCTTTACATTTAGCAGGAATTTCTTCTATTTTAGGAGCAATTAATTTTATTACAACAATTATTAATATACGTTTAAATAATTTATCTTTCGATCAAATACCTTTATTTGTTTGAGCAGTAGGAATTACAGCATTTTTACTATTACTTTCTTTACCTGTATTAGCTGGA---GCTATTACTATATTATTAACT---------------------------------------------------------------------------
>ARONW984-15
TGGTAACTGATTAGTTCCATTAATACTAGGAGCCCCTGATATAGCCTTCCCCCGAATAAATAATATAAGATTTTGACTTTTACCTCCTTCTCTAATTCTTCTTTTATCAAGGTCTATTATNGAAAATGGAGCA---------GGAACTGGCTGAACAGTTTACCCTCCCCTTTCTTNTAATATTTCCCATGCTGGAGCTTCTGTAGATCTTGCAATCTTTTCCCTACACCTAGCAGGTATTTCCTCAATCCTAGGGGCAGTTAAT------TTTATCACAACCGTAATTAACATACGCTCTAGAGGAATTACATTTGATCGAATGCCTTTATTTGTATGATCTGTATTAATTACAGCTATTCTTCTACTACTCTCCCTCCCAGTATTAGCAGGGGCTATTACAATACTACTCACAGACCGAAATTTAAAT-----------------------------------
這是我為此編寫的python腳本:
from Bio import SeqIO
from Bio.Seq import Seq
import csv
import sys
#Name of the datafile
Taxonomyfile = "02_Arthropoda_specimen_data_less.txt"
#Name of the original sequence file
OrigTaxonSeqsfile = "00_Arthropoda_specimen.fasta"
#Name of the output sequence file
f4 = open("02_Arthropoda_specimen_less.fasta", 'w')
#Reading the datafile and extracting record IDs
TaxaKeep = []
with open(Taxonomyfile, 'r') as f1:
datareader = csv.reader(f1, delimiter='\t')
for item in datareader:
TaxaKeep.append(item[0])
print(len(TaxaKeep))
#Filtering sequence file to keep only those sequences with the desired IDs
datareader = SeqIO.parse(OrigTaxonSeqsfile, "fasta")
for seq in datareader:
for item in TaxaKeep:
if item in seq.id:
f4.write('>' + str(item) + '\n')
f4.write(str(seq.seq) + '\n')
我認為這里的麻煩在於,我正在遍歷480萬條記錄中的每條170萬條記錄的名稱列表。 我考慮過要為480萬條記錄制作字典或類似的東西,但我不知道該怎么做。 有什么建議(包括非python建議)嗎?
謝謝!
我認為您可以通過改進外觀來大幅提高性能。
使用set()
可以幫助您。 集合被設計為可以非常快速地查找數據,並且它們不存儲重復的值,這使它們成為篩選數據的理想選擇。 因此,讓我們將輸入文件中的所有分類ID都存儲在一個集中。
from Bio import SeqIO
from Bio.Seq import Seq
import csv
import sys
taxonomy_file = "02_Arthropoda_specimen_data_less.txt"
orig_taxon_sequence_file = "00_Arthropoda_specimen.fasta"
output_sequence_file = "02_Arthropoda_specimen_less.fasta"
# build a set for fast look-up of IDs
with open(taxonomy_file, 'r', newline='') as fp:
datareader = csv.reader(fp, delimiter='\t')
first_column = (row[0] for row in datareader)
taxonomy_ids = set(first_column)
# use the set to speed up filtering the input FASTA file
with open(output_sequence_file, 'w') as fp:
for seq in SeqIO.parse(orig_taxon_sequence_file, "fasta"):
if seq.id in taxonomy_ids:
fp.write('>')
fp.write(seq.id)
fp.write(seq.seq)
fp.write('\n')
f4
是完全沒有意義的。 為什么不刪除注釋並直接將變量output_sequence_file
命名為? (row[0] for row in datareader)
是生成器 (row[0] for row in datareader)
。 生成器是一個可迭代的對象,這意味着它還沒有計算ID列表-它只知道要做什么。 通過不建立臨時列表來節省時間和內存。 一行之后,接受迭代器的set()
構造函數將根據第一列中的所有ID構建一個集合。 if seq.id in taxonomy_ids
來檢查是否應輸出序列ID。 in
是集速度非常快。 .write()
,而不是從四個項目中構建一個臨時字符串。 我假設seq.id
和seq.seq
已經是字符串,因此實際上沒有必要調用str()
。 SeqIO.write()
是創建格式的更好方法。 您的推論是正確的,使用兩個嵌套的for
循環將花費時間進行4.8 million * 1.7 million
重復的單個操作。
這就是為什么我們將使用SQLite數據庫存儲OrigTaxonSeqsfile
包含的所有信息的OrigTaxonSeqsfile
。 為什么選擇SQLite? 因為
我無法開始解釋CS理論,但是在您這樣的情況下,索引是搜索數據時的上帝。
索引完數據后,您只需從數據庫中的Taxonomyfile
查找每個記錄ID,然后將其寫入f4
最終輸出文件。
以下代碼應隨您的需要工作,它具有以下優點:
這是代碼
import sqlite3
from itertools import groupby
from contextlib import contextmanager
Taxonomyfile = "02_Arthropoda_specimen_data_less.txt"
OrigTaxonSeqsfile = "00_Arthropoda_specimen.fasta"
@contextmanager
def create_db(file_name):
""" create SQLite db, works as context manager so file is closed safely"""
conn = sqlite.connect(file_name, isolation_level="IMMEDIATE")
cur = conn.connect()
cur.execute("""
CREATE TABLE taxonomy
( _id INTEGER PRIMARY KEY AUTOINCREMENT
, record_id TEXT NOT NULL
, record_extras TEXT
, dna_sequence TEXT
);
CREATE INDEX idx_taxn_recID ON taxonomy (record_id);
""")
yield cur
conn.commit()
conn.close()
return
def parse_fasta(file_like):
""" generate that yields tuple containing record id, extra info
in tail of header and the DNA sequence with newline characters
"""
# inspiration = https://www.biostars.org/p/710/
try:
from Bio import SeqIO
except ImportError:
fa_iter = (x[1] for x in groupby(file_like, lambda line: line[0] == ">"))
for header in fa_iter:
# remove the >
info = header.__next__()[1:].strip()
# seprate record id from rest of the seqn info
x = info.split('|')
recID, recExtras = x[0], x[1:]
# build the DNA seq using generator
sequence = "".join(s.strip() for s in fa_iter.__next__())
yield recID, recExtras, sequence
else:
fasta_sequences = SeqIO.parse(file_like, 'fasta')
for fasta in fasta_sequences:
info, sequence = fasta.id, fasta.seq.tostring()
# seprate record id from rest of the seqn info
x = info.split('|')
recID, recExtras = x[0], x[1:]
yield recID, recExtras, sequence
return
def prepare_data(txt_file, db_file):
""" put data from txt_file into db_file building index on record id """
i = 0
src_gen = open(txt_file, mode='rt')
fasta_gen = parse_fasta(src_gen)
with create_db(db_file) as db:
for recID, recExtras, dna_seq in fasta_gen:
db.execute("""
INSERT INTO taxonomy
(record_id, record_extras, dna_sequence) VALUES (?,?,?)
""",
[recID, recExtras, dna_seq]
)
if i % 100 == 0:
print(i, 'lines digested into sql database')
src_gen.close()
return
def get_DNA_seq_of(recordID, src):
""" search for recordID in src database and return a formatted string """
ans = ""
exn = src.execute("SELECT * FROM taxonomy WHERE record_id=?", [recordID])
for match in exn.fetchall():
a, b, c, dna_seq = match
ans += ">%s\n%s\n" % (recordID, dna_seq)
return ans
def main():
# first of all prepare an optimized database
db_file = txt_file + ".sqlite"
prepare_data(OrigTaxonSeqsfile)
# now start searching and writing
progress = 0
db = sqlite3.connect(db_file)
cur = db.cursor()
out_file = open("02_Arthropoda_specimen_less.fasta", 'wt')
taxa_file = open(Taxonomyfile, 'rt')
with taxa_file, out_file:
for line in taxa_file:
question = line.split("\t")[0]
answer = get_DNA_seq_of(question, cur)
out_file.write(answer)
if progress % 100 == 0:
print(progress, 'lines processed')
db.close()
if __name__ == '__main__':
main()
隨時提出任何澄清。
如果您遇到任何錯誤或輸出結果與預期不符,請向我發送200行示例,分別包含Taxonomyfile
和OrigTaxonSeqsfile
然后我將更新代碼。
以下是一個粗略的估計,僅談論磁盤I / O,因為那是最慢的部分。
設a = 4.8 million
, b = 1.7 million
。
在舊方法中,您將必須執行磁盤I / O a * b
即81600億次。
在我的方法中,一旦進行索引(即2 * a次),就必須搜索170萬條記錄。 因此,在我的方法中,總時間為2 * (a + b)
即1300萬個磁盤I / O,這也不小,但是這種方法的速度要快60萬倍以上
dict()
呢? 如果發現我使用過多的CPU / RAM,我會受到老板和教授的責罵。 如果您擁有系統,則基於dict的簡單方法是:
from itertools import groupby
Taxonomyfile = "02_Arthropoda_specimen_data_less.txt"
OrigTaxonSeqsfile = "00_Arthropoda_specimen.fasta"
def parse_fasta(file_like):
""" generate that yields tuple containing record id, extra info
in tail of header and the DNA sequence with newline characters
"""
from Bio import SeqIO
fasta_sequences = SeqIO.parse(file_like, 'fasta')
for fasta in fasta_sequences:
info, sequence = fasta.id, fasta.seq.tostring()
# seprate record id from rest of the seqn info
x = info.split('|')
recID, recExtras = x[0], x[1:]
yield recID, recExtras, sequence
return
def prepare_data(txt_file, db_file):
""" put data from txt_file into dct """
i = 0
with open(txt_file, mode='rt') as src_gen:
fasta_gen = parse_fasta(src_gen)
for recID, recExtras, dna_seq in fasta_gen:
dct[recID] = dna_seq
if i % 100 == 0:
print(i, 'lines digested into sql database')
return
def get_DNA_seq_of(recordID, src):
""" search for recordID in src database and return a formatted string """
ans = ""
dna_seq = src[recordID]
ans += ">%s\n%s\n" % (recordID, dna_seq)
return ans
def main():
# first of all prepare an optimized database
dct = dict()
prepare_data(OrigTaxonSeqsfile, dct)
# now start searching and writing
progress = 0
out_file = open("02_Arthropoda_specimen_less.fasta", 'wt')
taxa_file = open(Taxonomyfile, 'rt')
with taxa_file, out_file:
for line in taxa_file:
question = line.split("\t")[0]
answer = get_DNA_seq_of(question, dct)
out_file.write(answer)
if progress % 100 == 0:
print(progress, 'lines processed')
return
if __name__ == '__main__':
main()
我已經在您的問題下的評論中要求澄清,但目前您沒有回應(無批評意味),因此我將在開始之前嘗試基於以下假設回答您的問題。
">"
,然后是一些用"|"
分隔的字段 這些字段的第一個字段是整個兩行記錄的ID。 根據以上假設
# If possible, no hardcoded filenames, use sys.argv and the command line
import sys
# command line sanity check
if len(sys.argv) != 4:
print('A descriptive error message')
sys.exit(1)
# Names of the input and output files
fn1, fn2, fn3 = sys.argv[1:]
# Use a set comprehension to load the IDs from the first file
IDs = {line.split()[0] for line in open(fn1)} # a set
# Operate on the second file
with open(fn2) as f2:
# It is possible to use `for line in f2: ...` but here we have(?)
# two line records, so it's a bit different
while True:
# Try to read two lines from file
try:
header = f2.next()
payload = f2.next()
# no more lines? break out from the while loop...
except StopIteration:
break
# Sanity check on the header line
if header[0] != ">":
print('Incorrect header line: "%s".'%header)
sys.exit(1)
# Split the header line on "|", find the current ID
ID = header[1:].split("|")[0]
# Check if the current ID was mentioned in the first file
if ID in IDs:
# your code
因為沒有內部循環,所以它應該快6個數量級...如果它滿足您的要求,還有待觀察:-)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.