简体   繁体   中英

Pandas: How to get Unique combinations of two column values in either ways?

I would like to know how we can get the unique combination of two column values if the values are in a similar combination. Below is the dataframe

I tried using below code but my expected output is different

df.groupby(['column1', 'column2'], as_index = False).agg({'expense': 'sum'})

This is a variant of this question , but an important distinction is that it seems you don't care about the order of column1 or column2 . Before I share the solution, here's the pseudocode:

  1. Create an id column which we can use to find rows where the sets of column1 and column2 are the same
  2. Apply the approach from the linked post to id .
  3. Drop duplicates based on id

Here's my manual transcription of the data. In the future, please provide the sample data as text, instead of as a screenshot.

column1,column2,salary
ram,shyam,100
sita,geeta,500
geeta,sita,300
shyam,ram,600
sohan,mohan,200
mohan,sohan,400

And here's the code

>>> import pandas as pd
>>> df = pd.read_csv('data.csv')
>>> hash_func = lambda n: hash("-".join(sorted(n)))
>>> df['id'] = df[['column1','column2']].apply(hash_func, axis=1)
>>> df
  column1 column2  salary                   id
0     ram   shyam     100 -1387604912582040812
1    sita   geeta     500  9030593041392264307
2   geeta    sita     300  9030593041392264307
3   shyam     ram     600 -1387604912582040812
4   sohan   mohan     200  6327789560655124249
5   mohan   sohan     400  6327789560655124249
>>> df['expense'] = df.groupby('id')['salary'].transform('sum')
>>> df
   column1 column2  salary                   id  expense
0     ram   shyam     100  7227562739062788100      700
1    sita   geeta     500  6328366926112663723      800
2   geeta    sita     300  6328366926112663723      800
3   shyam     ram     600  7227562739062788100      700
4   sohan   mohan     200 -3239226935758438599      600
5   mohan   sohan     400 -3239226935758438599      600
>>> df = df.drop_duplicates(subset=['id'])
>>> df
  column1 column2  salary                   id  expense
0     ram   shyam     100  7227562739062788100      700
1    sita   geeta     500  6328366926112663723      800
4   sohan   mohan     200 -3239226935758438599      600
>>> df = df.drop(columns=['id','salary']) # some extra cleanup
>>> df 
  column1 column2  expense
0     ram   shyam      700
1    sita   geeta      800
4   sohan   mohan      600

I followed these steps

df['pairs'] = df['col1'] + '-' + df['col2']

Then apply a foo function to this column The idea of this function is to take the pairs column data and sort it based on the 1st character of each element in the pair.

for example input is ram-shyam or ram-shaym we will get output as ram-shyam

This is the foo function -

def foo(s):
    lst_s = s.split('-')
    temp = {}
    for idx, name in enumerate(lst_s):
        temp[idx]= name[0]

    temp = dict(sorted(temp.items(), key=lambda item: item[1]))
    
    final = []
    for key in temp.keys():
        final.append(lst_s[key])
    return '-'.join(final)

Now apply this function on the pairs column

df['unique-pair'] = df['pairs'].apply(foo)

The output looks now like this -

col1    col2    salary  unique-pair     pairs
0   ram     shyam   100     ram-shyam   ram-shyam
1   sita    gita    500     gita-sita   sita-gita
2   gita    sita    300     gita-sita   gita-sita
3   shyam   ram     600     ram-shyam   shyam-ram
4   sohan   mohan   200     mohan-sohan     sohan-mohan
5   mohan   sohan   400     mohan-sohan     mohan-sohan

Now you can do a group by

df.groupby(['unique-pair']).agg({'salary':sum})

final output is -

salary
unique-pair     
gita-sita   800
mohan-sohan     600
ram-shyam   700

You can sort the first and second columns so that a,b and b,a are treated as the same in groupby.

Now, since the sort() is deprecated, we can use numpy sort and re-create a new dataframe.

Assuming the following csv_file :

 column1,column2,salary
    a,b,1
    c,b,3
    b,a,10
    b,c,30
    d,e,99

We can do it as follows:

import pandas as pd
import numpy as np

df = pd.read_csv("csvfile.csv",)
print("Original:\n ",df.head())

print ("\nGrouped sum:\n")
print ((pd.concat([pd.DataFrame(np.sort(df[df.columns[:2]], axis=1), columns=df.columns[:2]),\
                   df["salary"]], axis=1)).reset_index(drop=True, inplace=False).groupby\
                   (["column1", "column2"]).sum())

The output is shown below:

Original:
    column1 column2  salary
0       a       b       1
1       c       b       3
2       b       a      10
3       b       c      30
4       d       e      99

Grouped sum:

                 salary
column1 column2        
a       b            11
b       c            33
d       e            99

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM