Assign unique ID to Pandas group but add one if repeated

Question

I couldn't find a solution and want something faster than what I already have. So, the idea is to assign a unique ID for 'fruit' column, eg

df = pd.DataFrame(['apple', 'apple', 'orange', 'orange', 'lemon', 'apple', 'apple', 'lemon', 'lemon'], columns=['fruit'])

However, if repeated, add 1 to the last result, so that instead of:

df['id'] = [0, 0, 1, 1, 2, 0, 0, 2, 2]

I will end up with:

df['id'] = [0, 0, 1, 1, 2, 3, 3, 4, 4]

So it adds up until the end, even if there may only be 4 fruits changing their positions.

Here is my solution but it's really slow and I bet there is something that Pandas can do, inherently:

def create_ids(df):
 id_df = df.copy()
 i = 0
 last_row = None
 id_df['id'] = np.nan
 for row in id_df['fruits'].iteritems():
    if row[1] == last_row:
        id_df['id'].loc[row[0]] = i
        last_row = row[1]
    else:
        i += 1
        id_df['id'].loc[row[0]] = i
        last_row = row[1]
 return id_df['id']

Any ideas?

Answer 1

You can use .groupby() followed by ngroup() :

df["id"] = df.groupby((df["fruit"] != df["fruit"].shift(1)).cumsum()).ngroup()
print(df)

Prints:

    fruit  id
0   apple   0
1   apple   0
2  orange   1
3  orange   1
4   lemon   2
5   apple   3
6   apple   3
7   lemon   4
8   lemon   4

Or if you prefer itertools.groupby :

from itertools import groupby

data, i = [], 0
for _, g in groupby(df["fruit"]):
    data.extend([i] * sum(1 for _ in g))
    i += 1

df["id"] = data
print(df)

Assign unique ID to Pandas group but add one if repeated

Question

1 answers

solution1
1 ACCPTED 2021-03-28 21:54:43

Assign unique ID to Pandas group but add one if repeated

Question

1 answers

solution1 1 ACCPTED 2021-03-28 21:54:43

solution1
1 ACCPTED 2021-03-28 21:54:43