简体   繁体   中英

Compare a value to the value in the previous row and assign a value to another column (Pandas) - how to speed up?

I have a very large data file df (more than 10 million rows and 20 columns). I am comparing a value to the value in the previous row of the same column ( df['Name'] ). If the value is the same, the value in a second column ( df['Run'] ) stays the same, otherwise, this value is increased by 1.

Below an example of how the output should look like.

Name       Run
e679       1
k3333      2
k3333      2
k3333      2
u772       3
u772       3
2000       4
2000       4
2000       4
...        ...

At the moment I am using the following code:

run=1
df['Run'].iloc[0]=run

for i in range(1,len(df)):
    if df['Name'].iloc[i] == df['Name'].iloc[i-1]:
         df['Run'].iloc[i] = run
    else:
         run = run+1
         df['Run'].iloc[i] = run

This code works but it is very slow. I guess there is a more efficient way to do the same, does anyone has experience with that?

Thank you!

Use pd.factorize() like below:

print(df)
    Name
0   e679
1  k3333
2  k3333
3  k3333
4   u772
5   u772
6   2000
7   2000
8   2000

df['Run']=pd.factorize(df.Name)[0]+1
#alternative: (~df.duplicated('Name')).cumsum()
print(df)

    Name  Run
0   e679    1
1  k3333    2
2  k3333    2
3  k3333    2
4   u772    3
5   u772    3
6   2000    4
7   2000    4
8   2000    4

Note NaN will be marked as -1

这应该工作:

df['Run'] = (df['Name'] != df['Name'].shift()).cumsum()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM