简体   繁体   English

使用Python中的多个内核遍历for循环

[英]Iterate through a for loop using multiple cores in Python

I have the following code that is currently running like normal Python code: 我有以下当前正在像普通Python代码一样运行的代码:

def remove_missing_rows(app_list):
    print("########### Missing row removal ###########")
    missing_rows = []

''' Remove any row that has missing data in the name, id, or description column'''
    for row in app_list:
        if not row[1]:
            missing_rows.append(row)
            continue  # Continue loop to next row. No need to check more columns
        if not row[5]:
            missing_rows.append(row)
            continue  # Continue loop to next row. No need to check more columns
        if not row[4]:
            missing_rows.append(row)

    print("Number of missing entries: " + str(len(missing_rows)))  # 967 with current method

    # Remove the missing_rows from the original data
    app_list = [row for row in app_list if row not in missing_rows]
    return app_list

Now, after writing this for a smaller sample I wish to run this on a very large data set. 现在,在为一个较小的示例编写代码之后,我希望在非常大的数据集上运行它。 To do this I thought it would be useful to utilise the multiple cores of my computer. 为此,我认为利用计算机的多个内核会很有用。

I'm struggling to implement this using the multiprocessing module though. 我正在努力使用多处理模块来实现这一点。 Eg The idea I have is that Core 1 could work through the first half of the data set, while Core 2 would work through the last half. 例如,我的想法是核心1可以处理数据集的前半部分,而核心2可以处理数据集的后半部分。 Etc. And do this in parallel. 等等,并同时执行此操作。 Is this possible? 这可能吗?

This is probably not cpu bound. 这可能不是cpu绑定的。 Try the code below. 试试下面的代码。

I've used a set for very fast (hash-based) contains (you use it when you invoke if row not in missing_rows , and it's very slow for a long list). 我已经使用了一个非常快速的(基于哈希的) containssetif row not in missing_rows ,则在调用它时会使用它,并且对于很长的列表来说,它非常慢)。

If this is the csv module you're already holding tuples which are hashable so not many changes needed: 如果这是csv模块,则您已经拥有可哈希化的元组,因此不需要进行很多更改:

def remove_missing_rows(app_list):
    print("########### Missing row removal ###########")
    filterfunc = lambda row: not all([row[1], row[4], row[5]])
    missing_rows = set(filter(filterfunc, app_list))

    print("Number of missing entries: " + str(len(missing_rows)))  # 967 with current method

    # Remove the missing_rows from the original data
    # note: should be a lot faster with a set
    app_list = [row for row in app_list if row not in missing_rows]
    return app_list

You can use filter, to not iterate twice: 您可以使用filter来避免重复两次:

def remove_missing_rows(app_list):

    filter_func = lambda row: all((row[1], row[4], row[5]))

    return list(filter(filter_func, app_list))

But if you are doing data analysis, you probably should have a look into pandas. 但是,如果您要进行数据分析,则可能应该看看熊猫。 There you could do something like this: 在那里,您可以执行以下操作:

import pandas as pd

df = pd.read_csv('your/csv/data/file', usecols=(1, 4, 5))
df = df.dropna() # remove missing values

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM