使用Python中的多個內核遍歷for循環

Question

我有以下當前正在像普通Python代碼一樣運行的代碼：

def remove_missing_rows(app_list):
    print("########### Missing row removal ###########")
    missing_rows = []

''' Remove any row that has missing data in the name, id, or description column'''
    for row in app_list:
        if not row[1]:
            missing_rows.append(row)
            continue  # Continue loop to next row. No need to check more columns
        if not row[5]:
            missing_rows.append(row)
            continue  # Continue loop to next row. No need to check more columns
        if not row[4]:
            missing_rows.append(row)

    print("Number of missing entries: " + str(len(missing_rows)))  # 967 with current method

    # Remove the missing_rows from the original data
    app_list = [row for row in app_list if row not in missing_rows]
    return app_list

現在，在為一個較小的示例編寫代碼之后，我希望在非常大的數據集上運行它。 為此，我認為利用計算機的多個內核會很有用。

我正在努力使用多處理模塊來實現這一點。 例如，我的想法是核心1可以處理數據集的前半部分，而核心2可以處理數據集的后半部分。 等等，並同時執行此操作。 這可能嗎？

Answer 1

這可能不是cpu綁定的。 試試下面的代碼。

我已經使用了一個非常快速的（基於哈希的） contains的set （ if row not in missing_rows ，則在調用它時會使用它，並且對於很長的列表來說，它非常慢）。

如果這是csv模塊，則您已經擁有可哈希化的元組，因此不需要進行很多更改：

def remove_missing_rows(app_list):
    print("########### Missing row removal ###########")
    filterfunc = lambda row: not all([row[1], row[4], row[5]])
    missing_rows = set(filter(filterfunc, app_list))

    print("Number of missing entries: " + str(len(missing_rows)))  # 967 with current method

    # Remove the missing_rows from the original data
    # note: should be a lot faster with a set
    app_list = [row for row in app_list if row not in missing_rows]
    return app_list

Answer 2

您可以使用filter來避免重復兩次：

def remove_missing_rows(app_list):

    filter_func = lambda row: all((row[1], row[4], row[5]))

    return list(filter(filter_func, app_list))

但是，如果您要進行數據分析，則可能應該看看熊貓。 在那里，您可以執行以下操作：

import pandas as pd

df = pd.read_csv('your/csv/data/file', usecols=(1, 4, 5))
df = df.dropna() # remove missing values

使用Python中的多個內核遍歷for循環

問題描述

2 個解決方案

解決方案1
1 已采納 2016-06-06 19:26:02

解決方案2
1 2016-06-06 19:47:02

使用Python中的多個內核遍歷for循環

問題描述

2 個解決方案

解決方案1 1 已采納 2016-06-06 19:26:02

解決方案2 1 2016-06-06 19:47:02

解決方案1
1 已采納 2016-06-06 19:26:02

解決方案2
1 2016-06-06 19:47:02