使用Python中的多个内核遍历for循环

Question

I have the following code that is currently running like normal Python code: 我有以下当前正在像普通Python代码一样运行的代码：

def remove_missing_rows(app_list):
    print("########### Missing row removal ###########")
    missing_rows = []

''' Remove any row that has missing data in the name, id, or description column'''
    for row in app_list:
        if not row[1]:
            missing_rows.append(row)
            continue  # Continue loop to next row. No need to check more columns
        if not row[5]:
            missing_rows.append(row)
            continue  # Continue loop to next row. No need to check more columns
        if not row[4]:
            missing_rows.append(row)

    print("Number of missing entries: " + str(len(missing_rows)))  # 967 with current method

    # Remove the missing_rows from the original data
    app_list = [row for row in app_list if row not in missing_rows]
    return app_list

Now, after writing this for a smaller sample I wish to run this on a very large data set. 现在，在为一个较小的示例编写代码之后，我希望在非常大的数据集上运行它。 To do this I thought it would be useful to utilise the multiple cores of my computer. 为此，我认为利用计算机的多个内核会很有用。

I'm struggling to implement this using the multiprocessing module though. 我正在努力使用多处理模块来实现这一点。 Eg The idea I have is that Core 1 could work through the first half of the data set, while Core 2 would work through the last half. 例如，我的想法是核心1可以处理数据集的前半部分，而核心2可以处理数据集的后半部分。 Etc. And do this in parallel. 等等，并同时执行此操作。 Is this possible? 这可能吗？

Answer 1

This is probably not cpu bound. 这可能不是cpu绑定的。 Try the code below. 试试下面的代码。

I've used a set for very fast (hash-based) contains (you use it when you invoke if row not in missing_rows , and it's very slow for a long list). 我已经使用了一个非常快速的（基于哈希的） contains的set （ if row not in missing_rows ，则在调用它时会使用它，并且对于很长的列表来说，它非常慢）。

If this is the csv module you're already holding tuples which are hashable so not many changes needed: 如果这是csv模块，则您已经拥有可哈希化的元组，因此不需要进行很多更改：

def remove_missing_rows(app_list):
    print("########### Missing row removal ###########")
    filterfunc = lambda row: not all([row[1], row[4], row[5]])
    missing_rows = set(filter(filterfunc, app_list))

    print("Number of missing entries: " + str(len(missing_rows)))  # 967 with current method

    # Remove the missing_rows from the original data
    # note: should be a lot faster with a set
    app_list = [row for row in app_list if row not in missing_rows]
    return app_list

Answer 2

You can use filter, to not iterate twice: 您可以使用filter来避免重复两次：

def remove_missing_rows(app_list):

    filter_func = lambda row: all((row[1], row[4], row[5]))

    return list(filter(filter_func, app_list))

But if you are doing data analysis, you probably should have a look into pandas. 但是，如果您要进行数据分析，则可能应该看看熊猫。 There you could do something like this: 在那里，您可以执行以下操作：

import pandas as pd

df = pd.read_csv('your/csv/data/file', usecols=(1, 4, 5))
df = df.dropna() # remove missing values

使用Python中的多个内核遍历for循环

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-06-06 19:26:02

解决方案2
1 2016-06-06 19:47:02

使用Python中的多个内核遍历for循环

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-06-06 19:26:02

解决方案2 1 2016-06-06 19:47:02

解决方案1
1 已采纳 2016-06-06 19:26:02

解决方案2
1 2016-06-06 19:47:02