簡體   English   中英

從大型文件檢索行的更有效方法

[英]More efficient way to retrieve lines from a huge file

我有一個ID為1,786,916記錄的數據文件,我想從另一個包含約480萬條記錄的文件(在這種情況下為DNA序列,但基本上只是純文本)中檢索相應的記錄。 我寫了一個python腳本來做到這一點,但是要花很多時間才能運行(第3天,它只完成了12%)。 由於我是python的相對新手,所以我想知道是否有人建議加快此操作。

這是帶有ID的數據文件的示例(示例中的ID為ANICH889-10):

ANICH889-10 k__Animalia; p__Arthropoda; c__Insecta; o__Lepidoptera; f__Psychidae; g__Ardiosteres; s__Ardiosteres sp. ANIC9
ARONW984-15 k__Animalia; p__Arthropoda; c__Arachnida; o__Araneae; f__Clubionidae; g__Clubiona; s__Clubiona abboti

這是包含記錄的第二個文件的示例:

>ASHYE2081-10|Creagrura nigripesDHJ01|COI-5P|HM420985
ATTTTATACTTTTTATTAGGAATATGATCAGGAATAATTGGTCTTTCAATAAGAATCATTATCCGTATTGAATTAAGAAATCCAGGATCTATTATTAATAATGACCAAATTTATAATTCATTAATTACTATACACGCACTATTAATAATTTTTTTTTTAGTTATACCTGTAATAATTGGAGGATTTGGAAATTGATTAATTCCTATTATAATTGGAGCCCCAGATATAGCATTTCCACGAATAAACAATCTTAGATTTTGATTATTAATCCCATCAATTTTCATATTAATATTAAGATCAATTACTAATCAAGGTGTAGGAACAGGATGAACAATATATCCCCCATTATCATTAAATATAAATCAAGAAGGAATATCAATAGATATATCAATTTTTTCTTTACATTTAGCAGGAATATCCTCAATTTTAGGATCAATTAATTTCATTTCAACTATTTTAAATATAAAATTTATTAATTCTAATTATGATCAATTAACTTTATTTTCATGATCAATTCTAATTACTACTATTTTATTATTACTAGCAGTCCCTGTATTAGCAGGAGCAATTACTATAATTTTAACTGATCGAAATTTAAATACTTCTTTTTTTGATCCTAGAGGAGGAGGAGATCCAATTT-----------------
>BCISA145-10|Hemiptera|COI-5P
AACTCTATACTTTTTACTAGGATCCTGGGCAGGAATAGTAGGAACATCATTAAGATGAATAATCCGAATTGAACTAGGACAACCTGGATCTTTTATTGGAGATGACCAAACTTATAATGTAATTGTAACTGCCCACGCATTTGTAATAATTTTCTTTATAGTTATACCAATTATAATTGGAGGATTTGGAAATTGATTAATTCCCTTAATAATTGGAGCACCCGATATAGCATTCCCACGAATGAATAACATAAGATTTTGATTGCTACCACCGTCCCTAACACTTCTAATCATAAGTAGAATTACAGAAAGAGGAGCAGGAACAGGATGAACAGTATACCCTCCATTATCCAGAAACATCGCCCATAGAGGAGCATCTGTAGATTTAGCAATCTTTTCCCTACATCTAGCAGGAGTATCATCAATTTTAGGAGCAGTTAACTTCATTTCAACAATTATTAATATACGACCAGCAGGAATAACCCCAGAACGAATCCCATTATTTGTATGATCTGTAGGAATTACAGCACTACTACTCCTACTTTCATTACCCGTACTAGCAGGAGCCATTACCATACTCTTAACTGACCGAAACTTCAATACTTCTTTTTTTGACCCTGCTGGAGGAGGAGATCCCATCCTATATCAACATCTATTC

但是,在第二個文件中,DNA序列分為多行而不是單行,並且它們的長度並不總是相同。

編輯

這是我想要的輸出:

>ANICH889-10
GGGATTTGGTAATTGATTAGTTCCTTTAATA---TTGGGGGCCCCTGACATAGCTTTTCCTCGTATAAATAATATAAGATTTTGATTATTACCTCCCTCTCTTACATTATTAATTTCAAGAAGAATTGTAGAAAATGGAGCTGGGACTGGATGAACTGTTTACCCTCCTTTATCTTCTAATATCGCCCATAGAGGAAGCTCTGTAGATTTA---GCAATTTTCTCTTTACATTTAGCAGGAATTTCTTCTATTTTAGGAGCAATTAATTTTATTACAACAATTATTAATATACGTTTAAATAATTTATCTTTCGATCAAATACCTTTATTTGTTTGAGCAGTAGGAATTACAGCATTTTTACTATTACTTTCTTTACCTGTATTAGCTGGA---GCTATTACTATATTATTAACT---------------------------------------------------------------------------
>ARONW984-15
TGGTAACTGATTAGTTCCATTAATACTAGGAGCCCCTGATATAGCCTTCCCCCGAATAAATAATATAAGATTTTGACTTTTACCTCCTTCTCTAATTCTTCTTTTATCAAGGTCTATTATNGAAAATGGAGCA---------GGAACTGGCTGAACAGTTTACCCTCCCCTTTCTTNTAATATTTCCCATGCTGGAGCTTCTGTAGATCTTGCAATCTTTTCCCTACACCTAGCAGGTATTTCCTCAATCCTAGGGGCAGTTAAT------TTTATCACAACCGTAATTAACATACGCTCTAGAGGAATTACATTTGATCGAATGCCTTTATTTGTATGATCTGTATTAATTACAGCTATTCTTCTACTACTCTCCCTCCCAGTATTAGCAGGGGCTATTACAATACTACTCACAGACCGAAATTTAAAT-----------------------------------

這是我為此編寫的python腳本:

from Bio import SeqIO
from Bio.Seq import Seq
import csv
import sys

#Name of the datafile
Taxonomyfile = "02_Arthropoda_specimen_data_less.txt"

#Name of the original sequence file
OrigTaxonSeqsfile = "00_Arthropoda_specimen.fasta"

#Name of the output sequence file
f4 = open("02_Arthropoda_specimen_less.fasta", 'w')

#Reading the datafile and extracting record IDs   
TaxaKeep = []
with open(Taxonomyfile, 'r') as f1:
    datareader = csv.reader(f1, delimiter='\t')
    for item in datareader:
        TaxaKeep.append(item[0])
    print(len(TaxaKeep))    

#Filtering sequence file to keep only those sequences with the desired IDs
datareader = SeqIO.parse(OrigTaxonSeqsfile, "fasta")
for seq in datareader:
    for item in TaxaKeep:
        if item in seq.id:
            f4.write('>' + str(item) + '\n')
            f4.write(str(seq.seq) + '\n')

我認為這里的麻煩在於,我正在遍歷480萬條記錄中的每條170萬條記錄的名稱列表。 我考慮過要為480萬條記錄制作字典或類似的東西,但我不知道該怎么做。 有什么建議(包括非python建議)嗎?

謝謝!

我認為您可以通過改進外觀來大幅提高性能。

使用set()可以幫助您。 集合被設計為可以非常快速地查找數據,並且它們不存儲重復的值,這使它們成為篩選數據的理想選擇。 因此,讓我們將輸入文件中的所有分類ID都存儲在一個集中。

from Bio import SeqIO
from Bio.Seq import Seq
import csv
import sys

taxonomy_file = "02_Arthropoda_specimen_data_less.txt"
orig_taxon_sequence_file = "00_Arthropoda_specimen.fasta"
output_sequence_file = "02_Arthropoda_specimen_less.fasta"

# build a set for fast look-up of IDs
with open(taxonomy_file, 'r', newline='') as fp:
    datareader = csv.reader(fp, delimiter='\t')
    first_column = (row[0] for row in datareader)
    taxonomy_ids = set(first_column)

# use the set to speed up filtering the input FASTA file
with open(output_sequence_file, 'w') as fp:
    for seq in SeqIO.parse(orig_taxon_sequence_file, "fasta"):
        if seq.id in taxonomy_ids: 
            fp.write('>')
            fp.write(seq.id)
            fp.write(seq.seq)
            fp.write('\n')
  • 我已經重命名了您的一些變量。 僅在上面的注釋中寫“輸出序列文件的#Name”來命名變量f4是完全沒有意義的。 為什么不刪除注釋並直接將變量output_sequence_file命名為?
  • (row[0] for row in datareader)生成器 (row[0] for row in datareader) 生成器是一個可迭代的對象,這意味着它還沒有計算ID列表-它只知道要做什么。 通過建立臨時列表來節省時間和內存。 一行之后,接受迭代器的set()構造函數將根據第一列中的所有ID構建一個集合。
  • 在第二個塊中,我們使用if seq.id in taxonomy_ids來檢查是否應輸出序列ID。 in是集速度非常快。
  • 我要四次調用.write() ,而不是從四個項目中構建一個臨時字符串。 我假設seq.idseq.seq已經是字符串,因此實際上沒有必要調用str()
  • 我對FASTA文件格式了解不多 ,但是快速瀏覽一下BioPython文檔建議使用SeqIO.write()是創建格式的更好方法。

您的推論是正確的,使用兩個嵌套的for循環將花費時間進行4.8 million * 1.7 million重復的單個操作。

這就是為什么我們將使用SQLite數據庫存儲OrigTaxonSeqsfile包含的所有信息的OrigTaxonSeqsfile 為什么選擇SQLite? 因為

  • SQLite內置Python
  • SQLite支持索引

我無法開始解釋CS理論,但是在您這樣的情況下,索引是搜索數據時的上帝。

索引完數據后,您只需從數據庫中的Taxonomyfile查找每個記錄ID,然后將其寫入f4最終輸出文件。

以下代碼應隨您的需要工作,它具有以下優點:

  • 顯示您處理的行數方面的進展
  • 只需要Python 3,不需要嚴格的Bio庫
  • 使用生成器,因此無需一次將文件全部讀取到內存中
  • 不依賴於列表/集合/字典,因為(在這種情況下)它們可能會消耗過多的RAM

這是代碼

import sqlite3
from itertools import groupby
from contextlib import contextmanager

Taxonomyfile = "02_Arthropoda_specimen_data_less.txt"
OrigTaxonSeqsfile = "00_Arthropoda_specimen.fasta"

@contextmanager
def create_db(file_name):
    """ create SQLite db, works as context manager so file is closed safely"""
    conn = sqlite.connect(file_name, isolation_level="IMMEDIATE")
    cur = conn.connect()
    cur.execute("""
        CREATE TABLE taxonomy
        ( _id INTEGER PRIMARY KEY AUTOINCREMENT
        , record_id TEXT NOT NULL
        , record_extras TEXT
        , dna_sequence TEXT
        );
        CREATE INDEX idx_taxn_recID ON taxonomy (record_id);
    """)
    yield cur
    conn.commit()
    conn.close()
    return

def parse_fasta(file_like):
    """ generate that yields tuple containing record id, extra info
    in tail of header and the DNA sequence with newline characters
    """
    # inspiration = https://www.biostars.org/p/710/
    try:
        from Bio import SeqIO
    except ImportError:
        fa_iter = (x[1] for x in groupby(file_like, lambda line: line[0] == ">"))
        for header in fa_iter:
            # remove the >
            info = header.__next__()[1:].strip()
            # seprate record id from rest of the seqn info
            x = info.split('|')
            recID, recExtras = x[0], x[1:]
            # build the DNA seq using generator
            sequence = "".join(s.strip() for s in fa_iter.__next__())
            yield recID, recExtras, sequence
    else:
        fasta_sequences = SeqIO.parse(file_like, 'fasta')
        for fasta in fasta_sequences:
            info, sequence = fasta.id, fasta.seq.tostring()
            # seprate record id from rest of the seqn info
            x = info.split('|')
            recID, recExtras = x[0], x[1:]
            yield recID, recExtras, sequence
    return

def prepare_data(txt_file, db_file):
    """ put data from txt_file into db_file building index on record id """
    i = 0
    src_gen = open(txt_file, mode='rt')
    fasta_gen = parse_fasta(src_gen)
    with create_db(db_file) as db:
        for recID, recExtras, dna_seq in fasta_gen:
            db.execute("""
                INSERT INTO taxonomy
                (record_id, record_extras, dna_sequence) VALUES (?,?,?)
                """,
                [recID, recExtras, dna_seq]
            )
            if i % 100 == 0:
                print(i, 'lines digested into sql database')
    src_gen.close()
    return

def get_DNA_seq_of(recordID, src):
    """ search for recordID in src database and return a formatted string """
    ans = ""
    exn = src.execute("SELECT * FROM taxonomy WHERE record_id=?", [recordID])
    for match in exn.fetchall():
        a, b, c, dna_seq = match
        ans += ">%s\n%s\n" % (recordID, dna_seq)
    return ans

def main():
    # first of all prepare an optimized database
    db_file = txt_file + ".sqlite"
    prepare_data(OrigTaxonSeqsfile)
    # now start searching and writing
    progress = 0
    db = sqlite3.connect(db_file)
    cur = db.cursor()
    out_file = open("02_Arthropoda_specimen_less.fasta", 'wt')
    taxa_file = open(Taxonomyfile, 'rt')
    with taxa_file, out_file:
        for line in taxa_file:
            question = line.split("\t")[0]
            answer = get_DNA_seq_of(question, cur)
            out_file.write(answer)
            if progress % 100 == 0:
                print(progress, 'lines processed')
    db.close()

if __name__ == '__main__':
    main()

隨時提出任何澄清。
如果您遇到任何錯誤或輸出結果與預期不符,請向我發送200行示例,分別包含TaxonomyfileOrigTaxonSeqsfile然后我將更新代碼。


速度提升

以下是一個粗略的估計,僅談論磁盤I / O,因為那是最慢的部分。

a = 4.8 millionb = 1.7 million

在舊方法中,您將必須執行磁盤I / O a * b81600億次。

在我的方法中,一旦進行索引(即2 * a次),就必須搜索170萬條記錄。 因此,在我的方法中,總時間為2 * (a + b)即1300萬個磁盤I / O,這也不小,但是這種方法的速度要快60萬倍以上

為什么不使用dict()呢?

如果發現我使用過多的CPU / RAM,我會受到老板和教授的責罵。 如果您擁有系統,則基於dict的簡單方法是:

from itertools import groupby

Taxonomyfile = "02_Arthropoda_specimen_data_less.txt"
OrigTaxonSeqsfile = "00_Arthropoda_specimen.fasta"

def parse_fasta(file_like):
    """ generate that yields tuple containing record id, extra info
    in tail of header and the DNA sequence with newline characters
    """
    from Bio import SeqIO
    fasta_sequences = SeqIO.parse(file_like, 'fasta')
    for fasta in fasta_sequences:
        info, sequence = fasta.id, fasta.seq.tostring()
        # seprate record id from rest of the seqn info
        x = info.split('|')
        recID, recExtras = x[0], x[1:]
        yield recID, recExtras, sequence
    return

def prepare_data(txt_file, db_file):
    """ put data from txt_file into dct """
    i = 0
    with open(txt_file, mode='rt') as src_gen:
        fasta_gen = parse_fasta(src_gen)
        for recID, recExtras, dna_seq in fasta_gen:
            dct[recID] = dna_seq
            if i % 100 == 0:
                print(i, 'lines digested into sql database')
    return

def get_DNA_seq_of(recordID, src):
    """ search for recordID in src database and return a formatted string """
    ans = ""
    dna_seq = src[recordID]
    ans += ">%s\n%s\n" % (recordID, dna_seq)
    return ans

def main():
    # first of all prepare an optimized database
    dct = dict()
    prepare_data(OrigTaxonSeqsfile, dct)
    # now start searching and writing
    progress = 0
    out_file = open("02_Arthropoda_specimen_less.fasta", 'wt')
    taxa_file = open(Taxonomyfile, 'rt')
    with taxa_file, out_file:
        for line in taxa_file:
            question = line.split("\t")[0]
            answer = get_DNA_seq_of(question, dct)
            out_file.write(answer)
            if progress % 100 == 0:
                print(progress, 'lines processed')
    return

if __name__ == '__main__':
    main()

我已經在您的問題下的評論中要求澄清,但目前您沒有回應(無批評意味),因此我將在開始之前嘗試基於以下假設回答您的問題。

  1. 在第二個數據文件中,每個記錄占用兩行,第一行是各種標題,第二行是ACGT序列。
  2. 在標題行中,我們有一個前綴">" ,然后是一些用"|"分隔的字段 這些字段的第一個字段是整個兩行記錄的ID。

根據以上假設

# If possible, no hardcoded filenames, use sys.argv and the command line
import sys

# command line sanity check
if len(sys.argv) != 4:
     print('A descriptive error message')
     sys.exit(1)

# Names of the input and output files
fn1, fn2, fn3 = sys.argv[1:]

# Use a set comprehension to load the IDs from the first file
IDs = {line.split()[0] for line in open(fn1)} # a set

# Operate on the second file
with open(fn2) as f2:

    # It is possible to use `for line in f2: ...` but here we have(?)
    # two line records, so it's a bit different
    while True:

        # Try to read two lines from file
        try:
            header = f2.next()
            payload = f2.next()
        # no more lines? break out from the while loop...
        except StopIteration:
            break

        # Sanity check on the header line
        if header[0] != ">":
            print('Incorrect header line: "%s".'%header)
            sys.exit(1)

        # Split the header line on "|", find the current ID
        ID = header[1:].split("|")[0]

        # Check if the current ID was mentioned in the first file
        if ID in IDs:
            # your code

因為沒有內部循環,所以它應該快6個數量級...如果它滿足您的要求,還有待觀察:-)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM