提高 pandas 应用函数性能

Question

I have a pandas dataframe whose column contains dictionaries.我有一个 Pandas 数据框，它的列包含字典。 I also have a query dictionary and I want to compute minimum sum of the values of the common keys.我还有一个查询字典，我想计算公共键值的最小总和。
For example例如

dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
common keys = 'a', 'b'
s1 = dicta['a'] + dicta['b']
s2 = dictb['a'] + dictb['b']
result = min(s1, s2) = 2

I am using the following code to compute it.我正在使用以下代码来计算它。

def compute_common(dict1, dict2):

    common_keys = dict1.keys() & dict2.keys()
    im_count1 = sum((dict1[k] for k in common_keys))
    im_count2 = sum((dict2[k] for k in common_keys))
    return int(min(im_count1, im_count2))

Following are the timings for the operations on my i7 8 core machine with 8GB ram.以下是我的 i7 8 核 8GB 内存机器上的操作时间。

%timeit df['a'].apply(lambda x:compute_common(dictb, x))
55.2 ms ± 702 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

I also found out that, I can use swifter to improve the performance of pandas apply(by using multiprocessing internally)我还发现，我可以使用 swifter 来提高 Pandas apply 的性能（通过在内部使用多处理）

%timeit df['a'].swifter.progress_bar(False).apply(lambda x:compute_common(dictb, x))
66.4 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Using swifter is even slower(maybe because of the overhead of multiprocessing).使用 swifter 甚至更慢（可能是因为多处理的开销）。 I wanted to know if there is any way to squeeze more performance out of this operation.我想知道是否有任何方法可以从这个操作中榨取更多的性能。

You can use the following to replicate things.您可以使用以下内容来复制事物。

dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
df = pd.DataFrame({'a': [dicta] * 30000})

%timeit df['a'].apply(lambda x:compute_common(dictb, x))
%timeit df['a'].swifter.progress_bar(False).apply(lambda x:compute_common(dictb, x))

Thanks in advance.提前致谢。

Answer 1

use a list comprehension to find the values for the common keys then sum the list results finding the min between the two dictionary summed common key values.使用列表理解来查找公共键的值，然后对列表结果求和，找到两个字典和公共键值之间的最小值。 The common_keys are appended to a list creating ['a','b']. common_keys 附加到创建 ['a','b'] 的列表中。 The list comprehension then finds the values for a and b and sums them equaling 26 and 2. The min of 26 and 2 is 2.列表推导式然后找到 a 和 b 的值并将它们相加等于 26 和 2。 26 和 2 的最小值是 2。

def find_common_keys(dicta, dictb):
     '''
     >>> find_common_keys({'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}, {'a': 1, 
     'b': 1, 't': 34, 'g': 56, 'h': 67})
      2
      '''
    common_keys = [key  for key in dicta if key in dictb]

    s1 = sum(dicta[key] for key in common_keys)
    s2 = sum(dictb[key] for key in common_keys)
    return min(s1, s2)

dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}

print(find_common_keys(dicta,dictb))

output输出

Answer 2

You can explode the dictionaries to dataframes and sum them您可以将字典分解为数据框并将它们相加

dict_data = pd.DataFrame(df['a'].tolist())

common_keys = dict_data.columns.intersection(dictb.keys())

dictb_sum = sum(dictb[k] for k in common_keys)

dicta_sum = dict_data[common_keys].sum(1)

# also     
output = dicta_sum.clip(upper=dictb_sum)

This is twice as fast as apply on my system.这比在我的系统上apply快两倍。 Note that this works if union(x.keys() for x in df['a']) is not too big, since that all the columns of dict_data , but large enough so you can utilize the vectorized .sum(1) .请注意，如果union(x.keys() for x in df['a'])不是太大，则此方法有效，因为dict_data所有列都足够大，因此您可以使用矢量化.sum(1) 。

Answer 3

Following are some of my findings.以下是我的一些发现。 Sharing them so that it helps someone else.分享它们以帮助其他人。 Following are the optimizations I was able to achieve.以下是我能够实现的优化。 I tried extending @Golden Lions idea.我尝试扩展@Golden Lions 的想法。

Just compiling the function using cython, gives a 10% performance boost.只需使用 cython 编译该函数，就可以提高 10% 的性能。
Since python is loosely typed, writing the cython function with types further increases the performance.由于 Python 是松散类型的，因此使用类型编写 cython 函数会进一步提高性能。
Also since function calls in python are expensive, converting min(x1, x2) into x1 if x1 < x2 else x2 gives a performance benefit.此外，由于 python 中的函数调用很昂贵， x1 if x1 < x2 else x2将 min(x1, x2) 转换为x1 if x1 < x2 else x2会带来性能优势。

The final function I used gave me a 3x performance boost.我使用的最后一个函数使我的性能提高了 3 倍。

cpdef int cython_common(dict_1, dict_2):
    cdef dict dict1 = dict_1[0]
    cdef dict dict2 = dict_2[0]
    cdef list common_keys = [key  for key in dict1 if key in dict2]
    cdef int sum1 = 0
    cdef int sum2 = 0
    for i in common_keys:
        sum1 += dict1[i]
        sum2 +=dict2[i]
    return sum1 if sum1 < sum2 else sum2

Also, with some experiments I found out that libraries like pandarallel and swifter gave a speedup when the dataset has a large number of rows (for less number of rows I think the overhead of spawning processes and combining the results is much larger than compute itself.此外，有一些实验，我发现，像图书馆pandarallel和swifter给予了加速时的数据集有大量的行（对于行数较少，我认为产卵过程和结果相结合的开销比计算本身大得多。

Also this is a great read. 这也是一个很好的阅读。

提高 pandas 应用函数性能

问题描述

3 个解决方案

解决方案1
1 2021-11-02 13:49:49

解决方案2
0 2021-11-02 13:33:48

解决方案3
0 2021-11-10 16:36:20

提高 pandas 应用函数性能

问题描述

3 个解决方案

解决方案1 1 2021-11-02 13:49:49

解决方案2 0 2021-11-02 13:33:48

解决方案3 0 2021-11-10 16:36:20

解决方案1
1 2021-11-02 13:49:49

解决方案2
0 2021-11-02 13:33:48

解决方案3
0 2021-11-10 16:36:20