How do I efficiently assign a single value per groupby group in Pandas

Question

I have a Pandas DataFrame with a column of non-unique numbers. I want to return a different random number for each of the non-unique values, but return the same random number at each row the non-unique value appears ie so the shape of the output dataframe of random numbers matches that of the ungrouped data frame.

I can do this like: df.groupby('NonUnique').transform(lambda x: np.random.rand())

This returns a different random number for each column in df , as desired.

However, this is slow for large dataframes, but np.random.rand(df.size) is very fast. Is there any way to achieve what I want in a more efficient way? I can't seem to find a way to vectorise the assignment per group...

Answer 1

Create array by length of unique values, then use factorize with numpy indexing for repeating:

np.random.seed(123)

df = pd.DataFrame({'A':list('aaabbb')})

a = np.random.rand(len(df['A'].unique()))

df['B'] = a[pd.factorize(df.A)[0]]
print (df)
   A         B
0  a  0.696469
1  a  0.696469
2  a  0.696469
3  b  0.286139
4  b  0.286139
5  b  0.286139

Detail :

print (pd.factorize(df.A)[0])
[0 0 0 1 1 1]

Answer 2

I you're grouping by anyway, you can just use ngroup()

df.groupby('column').ngroup()

or

df.groupby('column').transform('ngroup')

How do I efficiently assign a single value per groupby group in Pandas

Question

2 answers

solution1
3 ACCPTED 2019-12-05 15:01:07

solution2
2 2019-12-05 15:07:01

How do I efficiently assign a single value per groupby group in Pandas

Question

2 answers

solution1 3 ACCPTED 2019-12-05 15:01:07

solution2 2 2019-12-05 15:07:01

solution1
3 ACCPTED 2019-12-05 15:01:07

solution2
2 2019-12-05 15:07:01