简体   繁体   English

改善非常慢的python代码的执行时间

[英]Improve execution time of very slow python code

This is my first post so please bear with me.这是我的第一篇文章,所以请耐心等待。 I need some help to optimize the below one liner.我需要一些帮助来优化下面的一个衬垫。

pd_df.loc[flag, 'COL_{}'.format(col_number)] = pd_df.loc[flag,'COL{}'.format(col_number)].apply(lambda x: x + str(userid) + "@")

pd_df : Panda data frame contains 2M rows pd_df : Panda 数据框包含 2M 行

flag= numpy one dimension boolean array to filter/update many rows at once in pd_df flag= numpy 一维布尔数组,用于在 pd_df 中一次过滤/更新多行

COL_{}'.format(col_number)= Random column number as per main FOR loop like COL_1,COL_5 upto COL_15 (Data type string with 5000 character length) COL_{}'.format(col_number)= 根据主 FOR 循环的随机列号,如 COL_1,COL_5 到 COL_15(5000 个字符长度的数据类型字符串)

In general what this code does it, first filter the rows to be updated according to the flag and column to be updated as per column number and append list of user id in those multiple rows and single column with @ as delimiter.一般来说,这段代码是做什么的,首先根据列号根据要更新的标志和列过滤要更新的行,并以@作为分隔符在这些多行和单列中附加用户ID列表。 For examples @userid1@userid2@userid2 and so one .例如@userid1@userid2@userid2 等等。

This line of code consume 75% of my overall time due to slow pandas data frame loc function and large no of rows ie 2M.由于 Pandas 数据帧 loc 函数缓慢且行数很大,即 2M,这行代码占用了我总时间的 75%。

Can someone please help me to convert this piece into something more optimized way like dictionary/numpy data type.有人可以帮我把这部分转换成更优化的方式,比如字典/numpy 数据类型。

Below is Output above code is creating.下面是上面代码正在创建的输出。 On the basis of Country and Category user id related, its userid is appended to that column number.根据相关的 Country 和 Category 用户 ID,将其用户 ID 附加到该列号。 Suppose Col_1 can contains upto userid3 and column2 upto userid7 and so on until col15.假设 Col_1 可以包含最多 userid3 和 column2 最多 userid7 等等,直到 col15。

在此处输入图片说明

Thanks in advance.提前致谢。

Regards, Liva问候, 丽瓦

Agreed that apply() can be slow.同意apply()可能很慢。 You want to try to take advantage of vectorized operations whenever possible.您希望尽可能利用矢量化操作。 Try using the concatenation operator ( + ).尝试使用连接运算符 ( + )。 Does this work any faster这是否工作得更快

pd_df.loc[flag, 'COL_{}'.format(col_number)] = pd_df.loc[flag,'COL{}'.format(col_number)] + (str(userid) + "@")

Furthermore, not sure if it would help, but some of these strings should be precalculated (probably Python is caching them already but in case not):此外,不确定它是否会有所帮助,但是应该预先计算其中一些字符串(可能 Python 已经缓存它们,但以防万一):

col_name = 'COL_{}'.format(col_number)
suffix = str(userid) + "@"
pd_df.loc[flag, col_name] = pd_df.loc[flag, col_name] + suffix

A couple of points:几点:

  1. f-strings are always faster than str.format , use them whenever possible: f-strings总是比str.format快,尽可能使用它们:

     In [3]: fmt = "{foo}" In [4]: %timeit fmt.format(foo=5) 299 ns ± 21.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: foo = 5 In [6]: %timeit f"{foo}" 79.2 ns ± 2.31 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  2. It seems userid is independent of the dataframe, I'm not sure why you are using apply , just use broadcasting :似乎userid独立于数据帧,我不确定您为什么使用apply ,只需使用broadcast

     In [8]: userid = "abcdef" In [9]: pd.Series('abc def ghi jkl'.split()) + f'@{userid}' Out[9]: 0 abc@abcdef 1 def@abcdef 2 ghi@abcdef 3 jkl@abcdef dtype: object

So the final approach could be something like this:所以最后的方法可能是这样的:

for num in range(5):
    flag = ... # calculate flag
    df[flag, f"col_{num}"] = df[flag, f"col_{num}"] + f"@{userid}"

apply是按项目运行函数的较慢方法之一。

pd_df.loc[flag, f’COL_{col_number}’] = pd_df.loc[flag, f’COL_{col_number}’].map(lambda x: f’{x}{userid}@‘)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM