[英]How to use itertools to extract groupby values?
data = [[12345,"AAA"],[12345,"BBB"],[12345,"CCC"],[98765,"KKK"],[98765,"MMM"],[56321,"JJJ"],[56321,"SSS"],[56321,"PPP"]]
df = pd.DataFrame(data,columns=['Sales_ID','Company_Name'])
Hi folks, I have above dataframe and I want to create a matching within each groupby Sales_ID. 大家好,我有以上数据框,我想在每个组内通过Sales_ID创建匹配。 How can I do that in python? 我怎么能在python中做到这一点?
I tried to groupby the df and extract all companies for each sales_ID, but don't know how to do next. 我尝试将df分组并为每个sales_ID提取所有公司,但不知道下一步该怎么做。
df.groupby('Sales_ID').apply(lambda x:x['Company_Name'].tolist())
Expected results: 预期成绩:
Sales_ID Company Company
12345 AAA BBB
12345 AAA CCC
12345 BBB CCC
98765 KKK MMM
56321 JJJ SSS
56321 JJJ PPP
56321 SSS PPP
Thanks for the help. 谢谢您的帮助。
Edit: @brentertainer points out that a cartesian product followed by a <
query is all you need to remove self-merges and duplicates irrespective of order. 编辑:@brentertainer指出,无论订单如何,都需要删除自我合并和重复的笛卡尔积,然后是<
查询。
df.merge(df, on='Sales_ID').query('Company_Name_x < Company_Name_y')
Original, more complicated solution sorted to drop duplicates irrespective of ordering 无论订购如何,原始的,更复杂的解决方案被排序为删除重复
import pandas as pd
import numpy as np
res = df.merge(df, on='Sales_ID').query('Company_Name_x != Company_Name_y')
cols = ['Company_Name_x', 'Company_Name_y']
res[cols] = np.sort(res[cols].to_numpy(), axis=1)
res = res.drop_duplicates()
Sales_ID Company_Name_x Company_Name_y
1 12345 AAA BBB
2 12345 AAA CCC
5 12345 BBB CCC
10 98765 KKK MMM
14 56321 JJJ SSS
15 56321 JJJ PPP
18 56321 PPP SSS
I am using itertools
我正在使用itertools
s=df.groupby('Sales_ID',sort=False)['Company_Name'].apply(list)
l=[list(itertools.combinations(x,2)) for x in s]
Newdf=pd.DataFrame({'Sales_ID':s.index.repeat(list(map(len,l)))})
Newdf=pd.concat([Newdf,pd.DataFrame(sum(l,[]))],axis=1)
Newdf
Sales_ID 0 1
0 12345 AAA BBB
1 12345 AAA CCC
2 12345 BBB CCC
3 98765 KKK MMM
4 56321 JJJ SSS
5 56321 JJJ PPP
6 56321 SSS PPP
Its is not always nescessary to use pandas
*. 使用pandas
*并不总是必要的。 I prefer using toolz
or funcy
to get the job done (that behind the screen use itertools
and other python native modules and methods) 我更喜欢使用toolz
或funcy
来完成工作(在屏幕后面使用itertools
和其他python本机模块和方法)
import itertools
import toolz # pip install toolz
import toolz.curried as tc
from operator import itemgetter
grouped_data = toolz.groupby(itemgetter(0), data)
{12345: [[12345, 'AAA'], [12345, 'BBB'], [12345, 'CCC']],
98765: [[98765, 'KKK'], [98765, 'MMM']],
56321: [[56321, 'JJJ'], [56321, 'SSS'], [56321, 'PPP']]}
Now to get the data you'd like you need to apply a series of steps: 现在要获取您想要的数据,您需要应用一系列步骤:
result = toolz.thread_first(data, # thread first pipes the data through series of functions
tc.groupby(itemgetter(0)), # group by first element
tc.valmap(tc.map(itemgetter(1))), # for each group extract the second element from a list of lists
tc.valmap(tc.partial(itertools.combinations, r=2)), # for each group make pairs
tc.valmap(list)) # this statement creates a list from the combinations generator function (it is howver not nescessary.)
The result: 结果:
{12345: [('AAA', 'BBB'), ('AAA', 'CCC'), ('BBB', 'CCC')],
98765: [('KKK', 'MMM')],
56321: [('JJJ', 'SSS'), ('JJJ', 'PPP'), ('SSS', 'PPP')]}
If you want to frame it into pandas you can. 如果你想把它框架成熊猫你可以。 Otherwise you can continue with a functional programming approach if this is what you seek. 否则,如果这是您所寻求的,您可以继续使用函数式编程方法。
*from my own experience especially in cloud environment with serverless applications - but thats besides the point *来自我自己的经验,尤其是在无服务器应用程序的云环境中 - 但这不仅仅是重点
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.