简体   繁体   中英

Speed up iteration over DataFrame items

I wrote a function in which each cell of a DataFrame is divided by a number saved in another dataframe.

def calculate_dfA(df_t,xout):

df_A = df_t.copy()
vector_x = xout.T

for index_col, column in tqdm(df_A.iteritems()): 
    for index_row, row in df_A.iterrows():  
        df_A.iloc[index_row,index_col] = df_A.iloc[index_row,index_col]/vector_x.iloc[0,index_col]

return(df_A)

The DataFrame on which I apply the calculation has a size of 14839 rows x 14839 columns. According to tqdm the processing speed is roughly 4.5s/it. Accordingly, the calculation will require approixmately 50 days which is not feasible for me. Is there a way to speed up my calculation?

You need to vectorize your division:

result = df_A.values/vector_x

This will broadcast along the row dimension and divide along the column dimension, as you seem to ask for.

Compared to your double for-loop, you are taking advantage of contiguity and homogeneity of the data in memory. This allows for a massive speedup.

Edit: Coming back to this answer today, I am spotting that converting to a numpy array first speeds up the computation. Locally I get a 10x speedup for an array of size similar to the one in the question here-above. Have edited my answer.

I'm on mobile now but you should try to avoid every for loop in python - theres always a better way

For one I know you can multiply a pandas column (Series) times a column to get your desired result. I think to multiply every column with the matching column of another DataFrame you would still need to iterate (but only with one for loop => performance boost)

I would strongly recommend that you temporarily convert to a numpy ndarray and work with these

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM