[英]Improving pandas apply function performance
I have a pandas dataframe whose column contains dictionaries.我有一个 Pandas 数据框,它的列包含字典。 I also have a query dictionary and I want to compute minimum sum of the values of the common keys.
我还有一个查询字典,我想计算公共键值的最小总和。
For example例如
dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
common keys = 'a', 'b'
s1 = dicta['a'] + dicta['b']
s2 = dictb['a'] + dictb['b']
result = min(s1, s2) = 2
I am using the following code to compute it.我正在使用以下代码来计算它。
def compute_common(dict1, dict2):
common_keys = dict1.keys() & dict2.keys()
im_count1 = sum((dict1[k] for k in common_keys))
im_count2 = sum((dict2[k] for k in common_keys))
return int(min(im_count1, im_count2))
Following are the timings for the operations on my i7 8 core machine with 8GB ram.以下是我的 i7 8 核 8GB 内存机器上的操作时间。
%timeit df['a'].apply(lambda x:compute_common(dictb, x))
55.2 ms ± 702 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I also found out that, I can use swifter to improve the performance of pandas apply(by using multiprocessing internally)我还发现,我可以使用 swifter 来提高 Pandas apply 的性能(通过在内部使用多处理)
%timeit df['a'].swifter.progress_bar(False).apply(lambda x:compute_common(dictb, x))
66.4 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Using swifter is even slower(maybe because of the overhead of multiprocessing).使用 swifter 甚至更慢(可能是因为多处理的开销)。 I wanted to know if there is any way to squeeze more performance out of this operation.
我想知道是否有任何方法可以从这个操作中榨取更多的性能。
You can use the following to replicate things.您可以使用以下内容来复制事物。
dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
df = pd.DataFrame({'a': [dicta] * 30000})
%timeit df['a'].apply(lambda x:compute_common(dictb, x))
%timeit df['a'].swifter.progress_bar(False).apply(lambda x:compute_common(dictb, x))
Thanks in advance.提前致谢。
use a list comprehension to find the values for the common keys then sum the list results finding the min between the two dictionary summed common key values.使用列表理解来查找公共键的值,然后对列表结果求和,找到两个字典和公共键值之间的最小值。 The common_keys are appended to a list creating ['a','b'].
common_keys 附加到创建 ['a','b'] 的列表中。 The list comprehension then finds the values for a and b and sums them equaling 26 and 2. The min of 26 and 2 is 2.
列表推导式然后找到 a 和 b 的值并将它们相加等于 26 和 2。 26 和 2 的最小值是 2。
def find_common_keys(dicta, dictb):
'''
>>> find_common_keys({'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}, {'a': 1,
'b': 1, 't': 34, 'g': 56, 'h': 67})
2
'''
common_keys = [key for key in dicta if key in dictb]
s1 = sum(dicta[key] for key in common_keys)
s2 = sum(dictb[key] for key in common_keys)
return min(s1, s2)
dicta = {'a': 5, 'b': 21, 'c': 34, 'd': 56, 'r': 67}
dictb = {'a': 1, 'b': 1, 't': 34, 'g': 56, 'h': 67}
print(find_common_keys(dicta,dictb))
output输出
2
You can explode the dictionaries to dataframes and sum them您可以将字典分解为数据框并将它们相加
dict_data = pd.DataFrame(df['a'].tolist())
common_keys = dict_data.columns.intersection(dictb.keys())
dictb_sum = sum(dictb[k] for k in common_keys)
dicta_sum = dict_data[common_keys].sum(1)
# also
output = dicta_sum.clip(upper=dictb_sum)
This is twice as fast as apply
on my system.这比在我的系统上
apply
快两倍。 Note that this works if union(x.keys() for x in df['a'])
is not too big, since that all the columns of dict_data
, but large enough so you can utilize the vectorized .sum(1)
.请注意,如果
union(x.keys() for x in df['a'])
不是太大,则此方法有效,因为dict_data
所有列都足够大,因此您可以使用矢量化.sum(1)
。
Following are some of my findings.以下是我的一些发现。 Sharing them so that it helps someone else.
分享它们以帮助其他人。 Following are the optimizations I was able to achieve.
以下是我能够实现的优化。 I tried extending @Golden Lions idea.
我尝试扩展@Golden Lions 的想法。
x1 if x1 < x2 else x2
gives a performance benefit.x1 if x1 < x2 else x2
将 min(x1, x2) 转换为x1 if x1 < x2 else x2
会带来性能优势。 The final function I used gave me a 3x performance boost.我使用的最后一个函数使我的性能提高了 3 倍。
cpdef int cython_common(dict_1, dict_2):
cdef dict dict1 = dict_1[0]
cdef dict dict2 = dict_2[0]
cdef list common_keys = [key for key in dict1 if key in dict2]
cdef int sum1 = 0
cdef int sum2 = 0
for i in common_keys:
sum1 += dict1[i]
sum2 +=dict2[i]
return sum1 if sum1 < sum2 else sum2
Also, with some experiments I found out that libraries like pandarallel
and swifter
gave a speedup when the dataset has a large number of rows (for less number of rows I think the overhead of spawning processes and combining the results is much larger than compute itself.此外,有一些实验,我发现,像图书馆
pandarallel
和swifter
给予了加速时的数据集有大量的行(对于行数较少,我认为产卵过程和结果相结合的开销比计算本身大得多。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.