I couldn't find a solution and want something faster than what I already have. So, the idea is to assign a unique ID for 'fruit' column, eg
df = pd.DataFrame(['apple', 'apple', 'orange', 'orange', 'lemon', 'apple', 'apple', 'lemon', 'lemon'], columns=['fruit'])
However, if repeated, add 1 to the last result, so that instead of:
df['id'] = [0, 0, 1, 1, 2, 0, 0, 2, 2]
I will end up with:
df['id'] = [0, 0, 1, 1, 2, 3, 3, 4, 4]
So it adds up until the end, even if there may only be 4 fruits changing their positions.
Here is my solution but it's really slow and I bet there is something that Pandas can do, inherently:
def create_ids(df):
id_df = df.copy()
i = 0
last_row = None
id_df['id'] = np.nan
for row in id_df['fruits'].iteritems():
if row[1] == last_row:
id_df['id'].loc[row[0]] = i
last_row = row[1]
else:
i += 1
id_df['id'].loc[row[0]] = i
last_row = row[1]
return id_df['id']
Any ideas?
You can use .groupby()
followed by ngroup()
:
df["id"] = df.groupby((df["fruit"] != df["fruit"].shift(1)).cumsum()).ngroup()
print(df)
Prints:
fruit id
0 apple 0
1 apple 0
2 orange 1
3 orange 1
4 lemon 2
5 apple 3
6 apple 3
7 lemon 4
8 lemon 4
Or if you prefer itertools.groupby
:
from itertools import groupby
data, i = [], 0
for _, g in groupby(df["fruit"]):
data.extend([i] * sum(1 for _ in g))
i += 1
df["id"] = data
print(df)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.