简体   繁体   English

Pandas中的项目组合和频率计数

[英]Items combination and frequency count in Pandas

I have Dataset like this:我有这样的数据集:

ORDER_CODE订购代码 ITEM_ID物品编号 ITEM_NAME项目名 TOTALPRICE总价
123 123 id1 id1 name1姓名1 345 345
321 321 id2 id2 name2名字2 678 678

and Function for calculation which items was sold together.和 Function 用于计算哪些物品是一起出售的。 Which ones was most popular or more expensive哪些最受欢迎或更贵

out:出去:

ITEM_ID物品编号 sold together一起卖
id1 id1 [ id33, id23, id12 ] [ id33, id23, id12 ]
id2 id2 [ id56, id663 ] [ id56, id663 ]

I using this Func:我使用这个函数:

def freq(df):
    
    hit_list = [list of ID's]
    
    result = pd.DataFrame(columns = ['ITEM_ID', 'sold together'])
    
    unic_arc  = df['ITEM_ID'].unique()
    unic_num = df['ORDER_CODE'].unique()
    data_arc ={}
    data_num={}
    for i in unic_arc:
        data_arc[i] = {}
        
    tturns = response_ur[['ITEM_ID', 'TOTALPRICE']].groupby(by = 'ITEM_ID', as_index = False).sum()
    tturns = tturns.rename(columns = {'ITEM_ID' : 'inum', 'TOTALPRICE' : 'turn'})
    
    for i in tqdm(unic_arc):
        b = df[df['ITEM_ID'] == i]['ORDER_CODE'].values
        for t in b:
            a = df[df['ORDER_CODE'] == t]['ID'].values
            if i in a:
                for arc in a:
                    if int(arc) not in hit_list: 
                        if arc != i:
                            if arc in data_arc[i]:
                                data_arc[i][arc]+=1
                            else:
                                data_arc[i][arc] = 1
                            
        dd = data_arc[i]
                
        tmp = pd.DataFrame(columns = ['inum', 'freq'])
        tmp['inum'] = data_arc[i].keys()
        tmp['freq'] = data_arc[i].values()
        
        tmp['inum'] = tmp['inum'].astype(str)
        tturns['inum'] = tturns['inum'].astype(str)
            
        tmp = pd.merge(tmp, tturns, on = 'inum', how = 'inner')

        tmp = tmp.sort_values(by = ['freq', 'turn'], ascending = False)
        
        if len(tmp['inum'].values) > 14:
            inums = str(tmp['inum'].values[0:15]).replace("\n", "").replace(' ', ',').replace('\'', '')
        else:
            inums = str(tmp['inum'].values).replace("\n", "").replace(' ', ',').replace('\'', '')
            
        result = res.append({'inum' : i, 'recs' : inums}, ignore_index = True)
                            
    return(result)

I try to add merge 1for addint ITEM_NAME in Func on any iteration, but it so long.我尝试在任何迭代中在 Func 中添加 merge 1for addint ITEM_NAME,但它太长了。 My dataset have about 10kk rows我的数据集有大约 10kk 行

I need add to my output one more column with list of 'ITEM_NAME' of 'sold together' list items.我需要在我的 output 中再添加一列,其中包含“一起出售”列表项的“ITEM_NAME”列表。 And calc it fast?并快速计算?

This might do it:这可能会做到:

import pandas as pd

df = pd.DataFrame( {
                    'ORDER_CODE':['123','321','123','123','321','555'], 
                    'ITEM_ID':[1,2,5,5,4,6],
                    'ITEM_NAME':['name1','name2','name3','name4','name5','name6'],
                    'TOTALPRICE':[10,20,50,50,40,60]}
                  )

result = df.groupby("ORDER_CODE").agg({"ITEM_ID":list, "ITEM_NAME":list, "TOTALPRICE":"sum"})

Further good answer how to create a list in a group by aggregation: 更好的答案是如何通过聚合在组中创建列表:

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM