简体   繁体   中英

Assign unique ID to Pandas group but add one if repeated

I couldn't find a solution and want something faster than what I already have. So, the idea is to assign a unique ID for 'fruit' column, eg

df = pd.DataFrame(['apple', 'apple', 'orange', 'orange', 'lemon', 'apple', 'apple', 'lemon', 'lemon'], columns=['fruit'])

However, if repeated, add 1 to the last result, so that instead of:

df['id'] = [0, 0, 1, 1, 2, 0, 0, 2, 2]

I will end up with:

df['id'] = [0, 0, 1, 1, 2, 3, 3, 4, 4]

So it adds up until the end, even if there may only be 4 fruits changing their positions.

Here is my solution but it's really slow and I bet there is something that Pandas can do, inherently:

def create_ids(df):
 id_df = df.copy()
 i = 0
 last_row = None
 id_df['id'] = np.nan
 for row in id_df['fruits'].iteritems():
    if row[1] == last_row:
        id_df['id'].loc[row[0]] = i
        last_row = row[1]
    else:
        i += 1
        id_df['id'].loc[row[0]] = i
        last_row = row[1]
 return id_df['id']

Any ideas?

You can use .groupby() followed by ngroup() :

df["id"] = df.groupby((df["fruit"] != df["fruit"].shift(1)).cumsum()).ngroup()
print(df)

Prints:

    fruit  id
0   apple   0
1   apple   0
2  orange   1
3  orange   1
4   lemon   2
5   apple   3
6   apple   3
7   lemon   4
8   lemon   4

Or if you prefer itertools.groupby :

from itertools import groupby

data, i = [], 0
for _, g in groupby(df["fruit"]):
    data.extend([i] * sum(1 for _ in g))
    i += 1

df["id"] = data
print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM