I have a very large data file df
(more than 10 million rows and 20 columns). I am comparing a value to the value in the previous row of the same column ( df['Name']
). If the value is the same, the value in a second column ( df['Run']
) stays the same, otherwise, this value is increased by 1.
Below an example of how the output should look like.
Name Run
e679 1
k3333 2
k3333 2
k3333 2
u772 3
u772 3
2000 4
2000 4
2000 4
... ...
At the moment I am using the following code:
run=1
df['Run'].iloc[0]=run
for i in range(1,len(df)):
if df['Name'].iloc[i] == df['Name'].iloc[i-1]:
df['Run'].iloc[i] = run
else:
run = run+1
df['Run'].iloc[i] = run
This code works but it is very slow. I guess there is a more efficient way to do the same, does anyone has experience with that?
Thank you!
Use pd.factorize()
like below:
print(df)
Name
0 e679
1 k3333
2 k3333
3 k3333
4 u772
5 u772
6 2000
7 2000
8 2000
df['Run']=pd.factorize(df.Name)[0]+1
#alternative: (~df.duplicated('Name')).cumsum()
print(df)
Name Run
0 e679 1
1 k3333 2
2 k3333 2
3 k3333 2
4 u772 3
5 u772 3
6 2000 4
7 2000 4
8 2000 4
Note NaN will be marked as -1
这应该工作:
df['Run'] = (df['Name'] != df['Name'].shift()).cumsum()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.