pandas 数据帧的并行化操作

Question

I have a very large pandas dataframe similar as follows:我有一个非常大的熊猫数据框，如下所示：

╔════════╦═════════════╦════════╗
║ index  ║    Users    ║ Income ║
╠════════╬═════════════╬════════╣
║ 0      ║ user_1      ║    304 ║
║ 1      ║ user_2      ║    299 ║
║ ...    ║             ║        ║
║ 399999 ║ user_400000 ║    542 ║
╚════════╩═════════════╩════════╝

(There are few columns more needed to do some calculations) （有几列需要做一些计算）

So, for every client I have to apply lots and lots of operations (shift, sums, substractions, conditions, etc) so it's impossible (to my believe) to apply boolean masking for everything, I have already tried, so my question is if it's possible to divide the pandas dataframe in chunks as follows, eg:因此，对于每个客户端，我都必须应用大量的操作（移位、求和、减法、条件等），因此（据我所知）不可能对所有内容应用布尔屏蔽，我已经尝试过，所以我的问题是可以按如下方式将 Pandas 数据帧分成块，例如：

# chunk 1
╔════════╦═════════════╦════════╗
║ index  ║    Users    ║ Income ║
╠════════╬═════════════╬════════╣
║ 0      ║ user_1      ║    304 ║
║ 1      ║ user_2      ║    299 ║
║ ...    ║             ║        ║
║ 19999  ║ user_20000  ║    432 ║
╚════════╩═════════════╩════════╝

# chunk 2
╔════════╦═════════════╦════════╗
║ index  ║    Users    ║ Income ║
╠════════╬═════════════╬════════╣
║ 20000  ║ user_20000  ║    199 ║
║ 20001  ║ user_20001  ║    412 ║
║ ...    ║             ║        ║
║ 39999  ║ user_40000  ║    725 ║
╚════════╩═════════════╩════════╝

# chunk K 
╔════════╦═════════════╦════════╗
║ index  ║    Users    ║ Income ║
╠════════╬═════════════╬════════╣
║ ...    ║ user_...    ║    ... ║
║ ...    ║ user_...    ║    ... ║
║ ...    ║             ║        ║
║ ...    ║ user_...    ║    ... ║
╚════════╩═════════════╩════════╝

And apply all the operations too all those chunks in parallel.并并行应用所有这些块的所有操作。

Answer 1

You can use a multiprocessing pool to accomplish some of those tasks, however, multiprocessing is also an expensive operation, so you need to test if parallelizing it is actually faster, it depends on the type of functions you are running and the data, for example I create a sample df :您可以使用多处理池来完成其中一些任务，但是，多处理也是一项昂贵的操作，因此您需要测试并行化它是否实际上更快，这取决于您正在运行的函数类型和数据，例如我创建了一个示例df ：

import pandas as pd
import numpy as np
from random import randint
from multiprocessing import Pool, cpu_count
from timeit import timeit


def f(df: pd.DataFrame):
    df['Something'] = df['Users'].apply(lambda name: len(name))
    df['Other stuff'] = df['Income'].apply(lambda income: 'Senior' if income > 200 else 'Junior')
    df['Some other stuff'] = df['Users'].apply(lambda name:  name.count('1'))
    return df


if __name__ == '__main__':
    samples = 5000000
    df = pd.DataFrame(
        [
            ['user_' + str(i), randint(0, 500)] for i in range(1, samples)
        ], columns=['Users', 'Income']
    )

If we time this version of the f function with multiprocessing I get 38.189200899999996 in my old laptop:如果我们使用多处理对这个版本的f函数38.189200899999996 ，我的旧笔记本电脑会得到38.189200899999996 ：

    parallelized = timeit("""
cores = cpu_count()
df_in_chunks = np.array_split(df, cores)
pool = Pool(cores)
result_df = pd.concat(pool.map(f, df_in_chunks))
pool.close()
pool.join()
    """, 
    "from __main__ import pd, np, df, Pool, cpu_count, f", 
    number=5
    )
    print(parallelized)

In this case I get 25.0754394 , so the overhead of using multiprocessing is bigger than the execution time of running the entire thing in a single core.在这种情况下，我得到25.0754394 ，因此使用多处理的开销大于在单个内核中运行整个事物的执行时间。

    not_parallelized = timeit("""
result_df = f(df)
    """, 
    "from __main__ import pd, df, f", 
    number=5
    )
    print(not_parallelized)

However if we add more complexity to the f function there is a point where broadcasting the df to each process is cheaper than running it in a single core.然而，如果我们给f函数增加更多的复杂性，那么将df广播到每个进程比在单个内核中运行它更便宜。

Answer 2

From my knowledge, the pandas GroupBy:-Split,-Apply,-Combine may address your issue.据我所知，pandas GroupBy:-Split,-Apply,-Combine可以解决您的问题。 Divide your DataFrame into multiple chunks (groups), and then apply a self-defined function to each chunk (group).将你的 DataFrame 分成多个块（组），然后对每个块（组）应用一个自定义函数。 We can talk codes if you have any further issues.如果您有任何其他问题，我们可以讨论代码。

Hope this helps!希望这可以帮助！

pandas 数据帧的并行化操作

问题描述

2 个解决方案

解决方案1
2 2020-03-15 19:45:09

解决方案2
0 2020-03-15 19:12:11

pandas 数据帧的并行化操作

问题描述

2 个解决方案

解决方案1 2 2020-03-15 19:45:09

解决方案2 0 2020-03-15 19:12:11

解决方案1
2 2020-03-15 19:45:09

解决方案2
0 2020-03-15 19:12:11