简体   繁体   中英

Improve execution time of very slow python code

This is my first post so please bear with me. I need some help to optimize the below one liner.

pd_df.loc[flag, 'COL_{}'.format(col_number)] = pd_df.loc[flag,'COL{}'.format(col_number)].apply(lambda x: x + str(userid) + "@")

pd_df : Panda data frame contains 2M rows

flag= numpy one dimension boolean array to filter/update many rows at once in pd_df

COL_{}'.format(col_number)= Random column number as per main FOR loop like COL_1,COL_5 upto COL_15 (Data type string with 5000 character length)

In general what this code does it, first filter the rows to be updated according to the flag and column to be updated as per column number and append list of user id in those multiple rows and single column with @ as delimiter. For examples @userid1@userid2@userid2 and so one .

This line of code consume 75% of my overall time due to slow pandas data frame loc function and large no of rows ie 2M.

Can someone please help me to convert this piece into something more optimized way like dictionary/numpy data type.

Below is Output above code is creating. On the basis of Country and Category user id related, its userid is appended to that column number. Suppose Col_1 can contains upto userid3 and column2 upto userid7 and so on until col15.

在此处输入图片说明

Thanks in advance.

Regards, Liva

Agreed that apply() can be slow. You want to try to take advantage of vectorized operations whenever possible. Try using the concatenation operator ( + ). Does this work any faster

pd_df.loc[flag, 'COL_{}'.format(col_number)] = pd_df.loc[flag,'COL{}'.format(col_number)] + (str(userid) + "@")

Furthermore, not sure if it would help, but some of these strings should be precalculated (probably Python is caching them already but in case not):

col_name = 'COL_{}'.format(col_number)
suffix = str(userid) + "@"
pd_df.loc[flag, col_name] = pd_df.loc[flag, col_name] + suffix

A couple of points:

  1. f-strings are always faster than str.format , use them whenever possible:

     In [3]: fmt = "{foo}" In [4]: %timeit fmt.format(foo=5) 299 ns ± 21.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) In [5]: foo = 5 In [6]: %timeit f"{foo}" 79.2 ns ± 2.31 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
  2. It seems userid is independent of the dataframe, I'm not sure why you are using apply , just use broadcasting :

     In [8]: userid = "abcdef" In [9]: pd.Series('abc def ghi jkl'.split()) + f'@{userid}' Out[9]: 0 abc@abcdef 1 def@abcdef 2 ghi@abcdef 3 jkl@abcdef dtype: object

So the final approach could be something like this:

for num in range(5):
    flag = ... # calculate flag
    df[flag, f"col_{num}"] = df[flag, f"col_{num}"] + f"@{userid}"

apply是按项目运行函数的较慢方法之一。

pd_df.loc[flag, f’COL_{col_number}’] = pd_df.loc[flag, f’COL_{col_number}’].map(lambda x: f’{x}{userid}@‘)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM