[英]Efficiently applying a function to a grouped pandas DataFrame in parallel
I often need to apply a function to the groups of a very large DataFrame
(of mixed data types) and would like to take advantage of multiple cores. 我经常需要将函数应用于非常大的
DataFrame
(混合数据类型)的组,并想利用多个内核。
I can create an iterator from the groups and use the multiprocessing module, but it is not efficient because every group and the results of the function must be pickled for messaging between processes. 我可以从组中创建一个迭代器并使用多处理模块,但是这样做效率不高,因为必须对每个组和函数的结果进行腌制,以便在进程之间进行消息传递。
Is there any way to avoid the pickling or even avoid the copying of the DataFrame
completely? 有什么方法可以避免酸洗,甚至完全避免复制
DataFrame
吗? It looks like the shared memory functions of the multiprocessing modules are limited to numpy
arrays. 看来,多处理模块的共享内存功能仅限于
numpy
数组。 Are there any other options? 还有其他选择吗?
From the comments above, it seems that this is planned for pandas
some time (there's also an interesting-looking rosetta
project which I just noticed). 从上面的评论来看,这似乎是为
pandas
计划的一段时间(还有一个我刚刚注意到的有趣的rosetta
项目 )。
However, until every parallel functionality is incorporated into pandas
, I noticed that it's very easy to write efficient & non-memory-copying parallel augmentations to pandas
directly using cython
+ OpenMP and C++. 但是,在将每种并行功能都整合到
pandas
,我注意到,使用cython
+ OpenMP和C ++直接向pandas
编写高效且无内存复制的并行增强非常容易。
Here's a short example of writing a parallel groupby-sum, whose use is something like this: 这是编写并行groupby-sum的简短示例,其用法如下所示:
import pandas as pd
import para_group_demo
df = pd.DataFrame({'a': [1, 2, 1, 2, 1, 1, 0], 'b': range(7)})
print para_group_demo.sum(df.a, df.b)
and output is: 输出为:
sum
key
0 6
1 11
2 4
Note Doubtlessly, this simple example's functionality will eventually be part of pandas
. 注意毫无疑问,这个简单示例的功能最终将成为
pandas
一部分。 Some things, however, will be more natural to parallelize in C++ for some time, and it's important to be aware of how easy it is to combine this into pandas
. 但是,有些事情在C ++中进行并行化会更加自然,并且重要的是要意识到将其组合到
pandas
是多么容易。
To do this, I wrote a simple single-source-file extension whose code follows. 为此,我编写了一个简单的单一源文件扩展名,其代码如下。
It starts with some imports and type definitions 它从一些导入和类型定义开始
from libc.stdint cimport int64_t, uint64_t
from libcpp.vector cimport vector
from libcpp.unordered_map cimport unordered_map
cimport cython
from cython.operator cimport dereference as deref, preincrement as inc
from cython.parallel import prange
import pandas as pd
ctypedef unordered_map[int64_t, uint64_t] counts_t
ctypedef unordered_map[int64_t, uint64_t].iterator counts_it_t
ctypedef vector[counts_t] counts_vec_t
The C++ unordered_map
type is for summing by a single thread, and the vector
is for summing by all threads. C ++
unordered_map
类型用于单个线程求和, vector
用于所有线程求和。
Now to the function sum
. 现在到函数
sum
。 It starts off with typed memory views for fast access: 它从键入的内存视图开始以快速访问:
def sum(crit, vals):
cdef int64_t[:] crit_view = crit.values
cdef int64_t[:] vals_view = vals.values
The function continues by dividing the semi-equally to the threads (here hardcoded to 4), and having each thread sum the entries in its range: 该函数通过将半等值除以线程(在此硬编码为4),并使每个线程将其范围内的条目相加来继续:
cdef uint64_t num_threads = 4
cdef uint64_t l = len(crit)
cdef uint64_t s = l / num_threads + 1
cdef uint64_t i, j, e
cdef counts_vec_t counts
counts = counts_vec_t(num_threads)
counts.resize(num_threads)
with cython.boundscheck(False):
for i in prange(num_threads, nogil=True):
j = i * s
e = j + s
if e > l:
e = l
while j < e:
counts[i][crit_view[j]] += vals_view[j]
inc(j)
When the threads have completed, the function merges all the results (from the different ranges) into a single unordered_map
: 线程完成后,该函数将所有结果(来自不同范围)合并到一个
unordered_map
:
cdef counts_t total
cdef counts_it_t it, e_it
for i in range(num_threads):
it = counts[i].begin()
e_it = counts[i].end()
while it != e_it:
total[deref(it).first] += deref(it).second
inc(it)
All that's left is to create a DataFrame
and return the results: 剩下的就是创建一个
DataFrame
并返回结果:
key, sum_ = [], []
it = total.begin()
e_it = total.end()
while it != e_it:
key.append(deref(it).first)
sum_.append(deref(it).second)
inc(it)
df = pd.DataFrame({'key': key, 'sum': sum_})
df.set_index('key', inplace=True)
return df
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.