简体   繁体   中英

How to avoid iterrows for this pandas dataframe processing

I need some help in converting the following code to a more efficient one without using iterrows().

for index, row in df.iterrows():
alist=row['index_vec'].strip("[] ").split(",")
blist=[int(i) for i in alist]
for col in blist:
    df.loc[index, str(col)] = df.loc[index, str(col)] +1

The above code basically reads a string under 'index_vec' column, parses and converts to integers, and then increments the associated columns by one for each integer. An example of the output is shown below:

在此处输入图像描述

Take the 0th row as an example. Its string value is "[370, 370, -1]". So the above code increments column "370" by 2 and column "-1" by 1. The output display is truncated so that only "-10" to "17" columns are shown.

The use of iterrows() is very slow to process a large dataframe. I'd like to get some help in speeding it up. Thank you.

Let us do

a=df['index_vec'].str.strip("[] ").str.split(",").explode()
s=pd.crosstab(a.index,a).reindex_like(df).fillna(0)
df=df.add(a)

You can also use apply and set axis = 1 to go row wise. Then create a custom function pass into apply :

Example starting df:

      index_vec  1201  370  -1
0  [370, -1, -1]     0    0   1
1   [1201, 1201]     0    1   1
import pandas as pd 

df = pd.DataFrame({'index_vec': ["[370, -1, -1]", "[1201, 1201]"], '1201': [0, 0], '370': [0, 1], '-1': [1, 1]})

def add_counts(x):
  counts = pd.Series(x['index_vec'].strip("[]").split(", ")).value_counts()
  x[counts.index] = x[counts.index] + counts
  return x

df.apply(add_counts, axis = 1)

print(df)

Outputs:

      index_vec  1201  370  -1
0  [370, -1, -1]     0    1   3
1   [1201, 1201]     2    1   1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM