提高Python匹配的效率

Question

如果我們輸入以下內容，並且如果它們的“ APPID列”（第4列）相同，並且其“類別”列（第18列）是一個“單元格”和一個“生化”或一個，那么我們希望保留這些行“細胞”和一個“酶”。

A，APPID，C，APP_ID，D，E，F，G，H，I，J，K，L，M，O，P，Q，類別，S，T
、、、 APP-1 、、、、、、、、、、、、、、、、、
、、、 APP-1 、、、、、、、、、、、、、、、、
、、、 APP-2 、、、、、、、、、、、、、、、、、
、、、 APP-3 、、、、、、、、、、、、、、、、、
,,, APP-3 ,,,,,,,,,,,,,生化，

理想的輸出將是

A，APPID，C，APP_ID，D，E，F，G，H，I，J，K，L，M，O，P，Q，類別，S，T
、、、 APP-1 、、、、、、、、、、、、、、、、
,,, APP-3 ,,,,,,,,,,,,,生化，
、、、 APP-1 、、、、、、、、、、、、、、、、、
、、、 APP-3 、、、、、、、、、、、、、、、、、

之所以保留“ APP-1”，是因為它們的第3列相同，並且類別是一個“單元格”，另一個是“酶”。 “ APP-3”具有相同的含義，在其“類別”列中具有一個“單元”，另一個具有“生化”。

以下嘗試可以達到目的：

import os

App=["1"]

for a in App:
    outname="App_"+a+"_target_overlap.csv"
    out=open(outname,'w')
    ticker=0
    cell_comp_id=[]
    final_comp_id=[]

    # make compound with cell activity (to a target) list first

    filename="App_"+a+"_target_Detail_average.csv"
    if os.path.exists(filename):
        file = open (filename)
        line=file.readlines()
        if(ticker==0): # Deal with the title
            out.write(line[0])
            ticker=ticker+1

            for c in line[1:]:
                c=c.split(',')
                if(c[17]==" Cell "):
                     cell_comp_id.append(c[3])
    else:
        cell_comp_id=list(set(cell_comp_id))

# while we have list of compounds with cell activity, now we search the Bio and Enz and make one final compound list

    if os.path.exists(filename):

        for c in line[1:]:
            temporary_line=c #for output_temp
            c=c.split(',')
            for comp in cell_comp_id:
                if (c[3]==comp and c[17]==" Biochemical "):
                    final_comp_id.append(comp)
                    out.write(str(temporary_line))
                elif (c[3]==comp and c[17]==" Enzyme "):
                    final_comp_id.append(comp)
                    out.write(str(temporary_line))
    else:
        final_comp_id=list(set(final_comp_id))

# After we obatin a final compound list in target a , we go through all the csv again for output the cell data

    filename="App_"+a+"_target_Detail_average.csv"

    if os.path.exists(filename):

        for c in line[1:]:
            temporary_line=c #for output_temp
            c=c.split(',')
            for final in final_comp_id:
                if (c[3]==final and c[17]==" Cell "):
                    out.write(str(temporary_line))

    out.close()

當輸入文件很小（數萬行）時，此腳本可以在合理的時間內完成其工作。 但是，輸入文件將變成數百萬到數十億行，此腳本將永遠花費很多時間（幾天...）。 我認為問題在於我們在第18列中創建帶有“單元格”的APPID列表。 然后，我們將這個“單元格”列表（可能是五百萬行）與整個文件（例如一百萬行）進行比較：如果單元格列表中的APPID與整個文件相同，則該行的第18列在整個文件中是“酶”或“生化”，我們保留信息。 此步驟似乎非常耗時。

我在想也許准備“細胞”，“酶”和“生化”詞典並比較它們會更快嗎？ 我可以知道是否有上師有更好的處理方法嗎？ 任何示例/評論都將有所幫助。 謝謝。

我們使用python 2.7.6。

Answer 1

有效地讀取文件

一個大問題是您使用readlines一次性讀取文件。 這將需要一次將其全部加載到內存中。 我懷疑您是否有足夠的可用內存。

嘗試：

with open(filename) as fh:
    out.write(fh.readline()) # ticker
    for line in fh: #iterate through lines 'lazily', reading as you go.
        c = line.split(',')

樣式代碼開始。 這應該有很大幫助。 在上下文中：

# make compound with cell activity (to a target) list first

if os.path.exists(filename):
    with open(filename) as fh:
        out.write(fh.readline()) # ticker
        for line in fh:
            cols = line.split(',')
            if cols[17] == " Cell ":
                cell_comp_id.append(cols[3])

with open(...) as語法是一種非常常見的python習慣用法，當您完成with塊或出現錯誤時，它將自動處理關閉文件。 很有用。

套

正如您所建議的，接下來的事情是使用sets更好。

您無需每次都重新創建集合，只需對其進行更新即可添加項目。 這是一些示例set代碼（以python interperter樣式編寫，在開始時是>>> ，這意味着要輸入一行內容-實際不要鍵入>>>位！）：

>>> my_set = set()
>>> my_set
set()

>>> my_set.update([1,2,3])
>>> my_set
set([1,2,3])

>>> my_set.update(["this","is","stuff"])
>>> my_set
set([1,2,3,"this","is","stuff"])

>>> my_set.add('apricot')
>>> my_set
set([1,2,3,"this","is","stuff","apricot"])

>>> my_set.remove("is")
>>> my_set
set([1,2,3,"this","stuff","apricot"])

因此您可以添加項目並將其從集合中刪除，而無需從頭開始創建新集合（每次使用cell_comp_id=list(set(cell_comp_id))位進行此操作）。

您還可以獲取差異，交集等：

>>> set(['a','b','c','d']) & set(['c','d','e','f'])
set(['c','d'])

>>> set([1,2,3]) | set([3,4,5])
set([1,2,3,4,5])

有關更多信息，請參閱文檔。

因此，讓我們嘗試一下：

cells = set()
enzymes = set()
biochemicals = set()

with open(filename) as fh:
    out.write(fh.readline()) #ticker
    for line in fh:
        cols = line.split(',')
        row_id = cols[3]
        row_category = cols[17]

        if row_category == ' Cell ':
            cells.add(row_id)
        elif row_category == ' Biochemical ':
            biochemicals.add(row_id)
        elif row_category == ' Enzyme ':
            enzymes.add(row_id)

現在您有了細胞，生化物質和酶的集合。 您只需要這些的橫截面，因此：

cells_and_enzymes = cells & enzymes
cells_and_biochemicals = cells & biochemicals

然后，您可以再次瀏覽所有文件，只需檢查row_id （或c[3] ）是否在這些列表中的任何一個中，如果是，則將其打印出來。

實際上，您甚至可以進一步合並這兩個列表：

cells_with_enz_or_bio = cells_and_enzymes | cells_and_biochemicals

這將是具有酶或生化物質的細胞。

因此，當您第二次運行文件時，可以執行以下操作：

if row_id in cells_with_enz_or_bio:
    out.write(line)

畢竟呢？

僅使用這些建議就可以使您滿意。 但是，您仍然將所有細胞，生化試劑和酶存儲在內存中。 而且您仍然兩次瀏覽文件。

因此，有兩種方法可以在不影響單個python進程的情況下加快速度。 我不知道您有多少可用內存。 如果內存不足，則可能會使速度稍慢。

減少集結。

如果您確實有一百萬條記錄，並且其中有80萬條記錄是成對的（具有細胞記錄和生化記錄），那么到列表末尾時，您將成組存儲800000個ID。 為了減少內存使用，一旦確定要輸出記錄，就可以將該信息（要打印記錄）保存到磁盤上的文件中，然后停止將其存儲在內存中。 然后，我們可以稍后再閱讀該列表，以找出要打印的記錄。

由於這確實會增加磁盤IO，因此速度可能會更慢。 但是，如果您的內存不足，則可以減少交換，從而更快地結束。 很難說。

with open('to_output.tmp','a') as to_output:
    for a in App:
        # ... do your reading thing into the sets ...

        if row_id in cells and (row_id in biochemicals or row_id in enzymes):
            to_output.write('%s,' % row_id)
            cells.remove(row_id)
            biochemicals.remove(row_id)
            enzymes.remove(row_id)

讀完所有文件后，您現在有了一個文件（ to_output.tmp ），其中包含您要保留的所有ID。 因此，您可以將其讀回python：

with open('to_output.tmp') as ids_file:
    ids_to_keep = set(ids_file.read().split(','))

這意味着您可以在第二次瀏覽文件時只需說：

if row_id in ids_to_keep:
    out.write(line)

使用`dict`而不是集合：

如果您有足夠的內存，則可以繞開所有內存，並使用dict來存儲數據，這將使您僅對文件運行一次，而不用完全使用集。

cells = {}
enzymes = {}
biochemicals = {}

with open(filename) as fh:
    out.write(fh.readline()) #ticker
    for line in fh:
        cols = line.split(',')
        row_id = cols[3]
        row_category = cols[17]

        if row_category == ' Cell ':
            cells[row_id] = line
        elif row_category == ' Biochemical ':
            biochemicals[row_id] = line
        elif row_category == ' Enzyme ':
            enzymes[row_id] = line

        if row_id in cells and row_id in biochemicals:
            out.write(cells[row_id])
            out.write(biochemicals[row_id])
            if row_id in enzymes:
                out.write(enzymes[row_id])
        elif row_id in cells and row_id in enzymes:
            out.write(cells[row_id])
            out.write(enzymes[row_id])

這種方法的問題在於，如果有任何行重復，則會造成混亂。

如果您確信輸入記錄是獨一無二的，他們要么有酶或生化記錄，而不是兩個，那么你可以輕松地添加del cells[row_id]和合適的人，從類型的字典中刪除行，一旦你已經印制它們，這將減少內存使用。

我希望這有幫助：-）

Answer 2

我曾經在Python中快速處理海量文件的一種技術是使用多處理庫將文件拆分為大塊，然后在輔助子進程中並行處理這些塊。

這是一般的算法：

根據將運行此腳本的系統上可用的內存量，確定一次可以讀取多少文件。 目標是在不引起抖動的情況下，使塊盡可能大。
將文件名和塊的開始/結束位置傳遞給子進程，每個子進程將打開文件，讀入並處理文件的各個部分，並返回其結果。

具體來說，我喜歡使用多處理池，然后創建塊開始/停止位置的列表，然后使用pool.map（）函數。 這將阻塞，直到每個人都完成為止，並且如果您從map調用中獲取了返回值，則每個子過程的結果將可用。

例如，您可以在子流程中執行以下操作：

# assume we have passed in a byte position to start and end at, and a file name:

with open("fi_name", 'r') as fi:
    fi.seek(chunk_start)
    chunk = fi.readlines(chunk_end - chunk_start)

retriever = operator.itemgetter(3, 17) # extracts only the elements we want
APPIDs = {}

for line in chunk:

    ID, category = retriever(line.split())
    try:
        APPIDs[ID].append(category) # we've seen this ID before, add category to its list
    except KeyError:
        APPIDs[ID] = [category] # we haven't seen this ID before - make an entry

# APPIDs entries will look like this:
# 
# <APPID> : [list of categories]

return APPIDs

在您的主要流程中，您將檢索所有返回的字典並解析重復項或重疊項，然后輸出如下內容：

for ID, categories in APPIDs.iteritems():
    if ('Cell' in categories) and ('Biochemical' in categories or 'Enzyme' in categories):
         # print or whatever

一些注意事項/注意事項：

請注意硬盤/ SSD /數據所在的位置上的負載。 如果您當前的方法已經在最大化其吞吐量，那么您可能不會看到任何性能改進。 您可以嘗試使用線程來實現相同的算法。
如果確實不是因為內存不足而造成了沉重的硬盤負載，則還可以減少池中允許的並發子進程數。 這將減少對驅動器的讀取請求，同時仍然利用真正的並行處理。
在可以利用的輸入數據中查找模式。 例如，如果您可以依靠匹配的APPID彼此相鄰，則實際上您可以在子流程中進行所有比較，並讓您的主流程一直閑逛直到其時間組合子流程數據結構。

TL; DR

將文件分成多個塊，然后與多處理庫並行處理。

提高Python匹配的效率

問題描述

2 個解決方案

解決方案1
3 已采納 2014-08-06 21:41:47

有效地讀取文件

套

畢竟呢？

減少集結。

使用`dict`而不是集合：

解決方案2
1 2014-08-06 22:13:19

TL; DR

提高Python匹配的效率

問題描述

2 個解決方案

解決方案1 3 已采納 2014-08-06 21:41:47

有效地讀取文件

套

畢竟呢？

減少集結。

使用dict而不是集合：

解決方案2 1 2014-08-06 22:13:19

TL; DR

解決方案1
3 已采納 2014-08-06 21:41:47

使用`dict`而不是集合：

解決方案2
1 2014-08-06 22:13:19