How to create a unique identifier based on multiple columns?

Question

I have a pandas dataframe that looks something like this:

    brand       description     former_price    discounted_price
0   A           icecream        1099.0          855.0   
1   A           cheese          469.0           375.0   
2   B           catfood         179.0           119.0   
3   C           NaN             699.0           399.0   
4   NaN         icecream        769.0           549.0
5   A           icecream        769.0           669.0

I want to create a column that will assign a unique value for each brand & description combination. Note that either the brand or the description can be missing from the dataset (notified by NaN value). Also, note that if the brand and the description is the same (duplicated) I still want the unique value to be the same for the row.

The output should look like this:

    product_key   brand         description     former_price    discounted_price
0   1             A             icecream        1099.0          855.0   
1   2             A             cheese          469.0           375.0   
2   3             B             catfood         179.0           119.0   
3   4             C             NaN             699.0           399.0   
4   5             NaN           icecream        769.0           549.0
5   1             A             icecream        769.0           669.0

The values in product_key can be anything, I just want them to be unique based on brand and description columns. Any help is immensely appreciated!

Thanks a lot!

Answer 1

You could try with pd.Series.factorize :

df.set_index(['brand','description']).index.factorize()[0]+1

Output:

So you could try this, to assign it to be the first column:

df.insert(loc=0, column='product_key', value=df.set_index(['brand','description']).index.factorize()[0]+1)

Output:

df
   product_key brand description  former_price  discounted_price
0            1     A    icecream        1099.0             855.0
1            2     A      cheese         469.0             375.0
2            3     B     catfood         179.0             119.0
3            4     C         NaN         699.0             399.0
4            5   NaN    icecream         769.0             549.0
5            1     A    icecream         769.0             669.0

Answer 2

with groupby+ngroup :

(df.fillna({'brand':'','description':''})
   .groupby(['brand','description'],sort=False).ngroup()+1)

How to create a unique identifier based on multiple columns?

Question

2 answers

solution1
4 ACCPTED 2020-07-15 15:43:43

solution2
3 2020-07-15 15:38:15

How to create a unique identifier based on multiple columns?

Question

2 answers

solution1 4 ACCPTED 2020-07-15 15:43:43

solution2 3 2020-07-15 15:38:15

solution1
4 ACCPTED 2020-07-15 15:43:43

solution2
3 2020-07-15 15:38:15