在這種情況下，如何使python腳本運行更快或使用多進程？

Question

我正在嘗試在800K對文檔中測量四個相似度（cosine_similarity，jaccard，Sequence Matcher相似度，jaccard_variants相似度）。

每個文檔文件都是txt格式，大約100KB〜300KB（大約1500000個字符）。

關於如何使我的python腳本更快，我有兩個問題：

我的PYTHON腳本：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from difflib import SequenceMatcher

def get_tf_vectors(doc1, doc2):
    text = [doc1, doc2]
    vectorizer = CountVectorizer(text)
    vectorizer.fit(text)
    return vectorizer.transform(text).toarray()

def measure_sim(doc1, doc2):
    a, b = doc1.split(), doc2.split()
    c, d = set(a), set(b)
    vectors = [t for t in get_tf_vectors(doc1, doc2)]
    return cosine_similarity(vectors)[1][0], float(len(c&d) / len(c|d)), \
            1 - (sum(abs(vectors[0] - vectors[1])) / sum(vectors[0] + vectors[1])), \
            SequenceMatcher(None, a, b).ratio()

#items in doc_pair list are like('ID', 'doc1_directory', 'doc2_directory')
def data_analysis(doc_pair_list):
    result = {}
    for item in doc_pair_list:
        f1 = open(item[1], 'rb')
        doc1 = f1.read()
        f1.close()
        f2 = oepn(item[2], 'rb')
        doc2 = f2.read()
        f2.close()
        result[item[0]] = measure_sim(doc1, doc2)

但是，此代碼僅占用我10％的CPU，並且幾乎需要20天才能完成此任務。 因此，我想問一下是否有任何方法可以使此代碼更有效。

Q1。 由於文檔保存在HDD中，因此我認為加載這些文本數據需要一些時間。 因此，我懷疑每次計算機計算相似度時僅加載兩個文檔可能效率不高。 因此，我將嘗試一次加載50對文檔並分別計算相似度。 這會有所幫助嗎？

Q2。 關於“如何使代碼運行更快”的大多數帖子都說我應該使用基於C代碼的Python模塊。 但是，由於我使用的是效率很高的sklearn模塊，所以我想知道還有沒有更好的方法。

有什么方法可以幫助此python腳本使用更多計算機資源並變得更快？

Answer 1

也許有更好的解決方案，但如果阻止相似性計數，您可以嘗試這樣的方法：1）一個單獨的過程，一個接一個地讀取所有文件並將其放入多處理隊列。2）多個工作程序池計數相似性並將結果放入multiprocessing.Queue的過程。 3）然后，主線程只需從results_queue加載結果，然后將其保存到字典中即可。

我不知道您的硬件限制（CPU內核的數量和速度，RAM大小，磁盤讀取速度），並且我沒有任何樣本可以對其進行測試。 編輯：下面提供了描述的代碼。 請嘗試檢查是否更快，並通知我。 如果主阻止程序正在加載文件，則我們可以創建更多的加載程序進程（例如2個進程，每個進程都加載一半文件）。 如果阻止程序正在計算相似性，則可以創建更多工作進程（只需更改worker_count）。 最后，“結果”是包含所有結果的字典。

    import multiprocessing
    import os
    from difflib import SequenceMatcher
    from sklearn.feature_extraction.text import CountVectorizer
    from sklearn.metrics.pairwise import cosine_similarity


    def get_tf_vectors(doc1, doc2):
        text = [doc1, doc2]
        vectorizer = CountVectorizer(text)
        vectorizer.fit(text)
        return vectorizer.transform(text).toarray()


    def calculate_similarities(doc_pairs_queue, results_queue):
        """ Pick docs from doc_pairs_queue and calculate their similarities, save the result to results_queue. Repeat infinitely (until process is terminated). """
        while True:
            pair = doc_pairs_queue.get()
            pair_id = pair[0]
            doc1 = pair[1]
            doc2 = pair[2]
            a, b = doc1.split(), doc2.split()
            c, d = set(a), set(b)
            vectors = [t for t in get_tf_vectors(doc1, doc2)]
            results_queue.put((pair_id, cosine_similarity(vectors)[1][0], float(len(c&d) / len(c|d)),
                1 - (sum(abs(vectors[0] - vectors[1])) / sum(vectors[0] + vectors[1])),
                SequenceMatcher(None, a, b).ratio()))


    def load_files(doc_pair_list, loaded_queue):
        """
        Pre-load files and put them to a queue, so working processes can get them.
        :param doc_pair_list: list of files to be loaded (ID, doc1_path, doc2_path)
        :param loaded_queue: multiprocessing.Queue that will hold pre-loaded data
        """
        print("Started loading files...")
        for item in doc_pair_list:
            with open(item[1], 'rb') as f1:
                with open(item[2], 'rb') as f2:
                    loaded_queue.put((item[0], f1.read(), f2.read()))  # if queue is full, this automatically waits until there is space

        print("Finished loading files.")


    def data_analysis(doc_pair_list):
        # create a loader process that will pre-load files (it does no calculations, so it loads much faster)
        # loader puts loaded files to a queue; 1 pair ~ 500 KB, 1000 pairs ~ 500 MB max size of queue (RAM memory)
        loaded_pairs_queue = multiprocessing.Queue(maxsize=1000)
        loader = multiprocessing.Process(target=load_files, args=(doc_pair_list, loaded_pairs_queue))
        loader.start()

        # create worker processes - these will do all calculations
        results_queue = multiprocessing.Queue(maxsize=1000)  # workers put results to this queue
        worker_count = os.cpu_count() if os.cpu_count() else 2  # number of worker processes
        workers = []  # create list of workers, so we can terminate them later
        for i in range(worker_count):
            worker = multiprocessing.Process(target=calculate_similarities, args=(loaded_pairs_queue, results_queue))
            worker.start()
            workers.append(worker)

        # main process just picks the results from queue and saves them to the dictionary
        results = {}
        i = 0  # results counter
        pairs_count = len(doc_pair_list)
        while i < pairs_count:
            res = results_queue.get(timeout=600)  # timeout is just in case something unexpected happened (results are calculated much quicker)
            # Queue.get() is blocking - if queue is empty, get() waits until something is put into queue and then get it
            results[res[0]] = res[1:]  # save to dictionary by ID (first item in the result)

        # clean up the processes (so there aren't any zombies left)
        loader.terminate()
        loader.join()
        for worker in workers:
            worker.terminate()
            worker.join()

請讓我知道結果，我對此非常感興趣，並在需要時為您提供進一步的幫助;）

Answer 2

首先要做的就是看看您是否能找到真正的瓶頸，我認為使用cProfile可能會證實您的懷疑或進一步揭示您的問題。

您應該能夠像這樣使用cProfile運行未經修改的代碼：

python -m cProfile -o profiling-results python-file-to-test.py

之后，您可以使用pstats分析結果，如下所示：

import pstats
stats = pstats.Stats("profiling-results")
stats.sort_stats("tottime")

stats.print_stats(10)

Marco Bonazanin的博客文章My Python Code is Slow？是有關分析代碼的更多信息。 分析技巧

在這種情況下，如何使python腳本運行更快或使用多進程？

問題描述

2 個解決方案

解決方案1
2 2018-08-20 12:03:02

解決方案2
1 2018-08-20 11:15:57

在這種情況下，如何使python腳本運行更快或使用多進程？

問題描述

2 個解決方案

解決方案1 2 2018-08-20 12:03:02

解決方案2 1 2018-08-20 11:15:57

解決方案1
2 2018-08-20 12:03:02

解決方案2
1 2018-08-20 11:15:57