data = [[12345,"AAA"],[12345,"BBB"],[12345,"CCC"],[98765,"KKK"],[98765,"MMM"],[56321,"JJJ"],[56321,"SSS"],[56321,"PPP"]]
df = pd.DataFrame(data,columns=['Sales_ID','Company_Name'])
Hi folks, I have above dataframe and I want to create a matching within each groupby Sales_ID. How can I do that in python?
I tried to groupby the df and extract all companies for each sales_ID, but don't know how to do next.
df.groupby('Sales_ID').apply(lambda x:x['Company_Name'].tolist())
Expected results:
Sales_ID Company Company
12345 AAA BBB
12345 AAA CCC
12345 BBB CCC
98765 KKK MMM
56321 JJJ SSS
56321 JJJ PPP
56321 SSS PPP
Thanks for the help.
Edit: @brentertainer points out that a cartesian product followed by a <
query is all you need to remove self-merges and duplicates irrespective of order.
df.merge(df, on='Sales_ID').query('Company_Name_x < Company_Name_y')
Original, more complicated solution sorted to drop duplicates irrespective of ordering
import pandas as pd
import numpy as np
res = df.merge(df, on='Sales_ID').query('Company_Name_x != Company_Name_y')
cols = ['Company_Name_x', 'Company_Name_y']
res[cols] = np.sort(res[cols].to_numpy(), axis=1)
res = res.drop_duplicates()
Sales_ID Company_Name_x Company_Name_y
1 12345 AAA BBB
2 12345 AAA CCC
5 12345 BBB CCC
10 98765 KKK MMM
14 56321 JJJ SSS
15 56321 JJJ PPP
18 56321 PPP SSS
I am using itertools
s=df.groupby('Sales_ID',sort=False)['Company_Name'].apply(list)
l=[list(itertools.combinations(x,2)) for x in s]
Newdf=pd.DataFrame({'Sales_ID':s.index.repeat(list(map(len,l)))})
Newdf=pd.concat([Newdf,pd.DataFrame(sum(l,[]))],axis=1)
Newdf
Sales_ID 0 1
0 12345 AAA BBB
1 12345 AAA CCC
2 12345 BBB CCC
3 98765 KKK MMM
4 56321 JJJ SSS
5 56321 JJJ PPP
6 56321 SSS PPP
Its is not always nescessary to use pandas
*. I prefer using toolz
or funcy
to get the job done (that behind the screen use itertools
and other python native modules and methods)
import itertools
import toolz # pip install toolz
import toolz.curried as tc
from operator import itemgetter
grouped_data = toolz.groupby(itemgetter(0), data)
{12345: [[12345, 'AAA'], [12345, 'BBB'], [12345, 'CCC']],
98765: [[98765, 'KKK'], [98765, 'MMM']],
56321: [[56321, 'JJJ'], [56321, 'SSS'], [56321, 'PPP']]}
Now to get the data you'd like you need to apply a series of steps:
result = toolz.thread_first(data, # thread first pipes the data through series of functions
tc.groupby(itemgetter(0)), # group by first element
tc.valmap(tc.map(itemgetter(1))), # for each group extract the second element from a list of lists
tc.valmap(tc.partial(itertools.combinations, r=2)), # for each group make pairs
tc.valmap(list)) # this statement creates a list from the combinations generator function (it is howver not nescessary.)
The result:
{12345: [('AAA', 'BBB'), ('AAA', 'CCC'), ('BBB', 'CCC')],
98765: [('KKK', 'MMM')],
56321: [('JJJ', 'SSS'), ('JJJ', 'PPP'), ('SSS', 'PPP')]}
If you want to frame it into pandas you can. Otherwise you can continue with a functional programming approach if this is what you seek.
*from my own experience especially in cloud environment with serverless applications - but thats besides the point
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.