简体   繁体   English

当一个列表的所有元素都在另一个列表中时,如何分组和求和

[英]How to group by and sum when all elements of one list are in another list

I have a data frame df1. 我有一个数据帧df1。 "transactions" column has an array of int. “交易”列具有一个int数组。

id     transactions
1      [1,2,3]
2      [2,3]

data frame df2. 数据帧df2。 "items" column has an array of int. “ items”列具有一个int数组。

items  cost
[1,2]  2.0
[2]    1.0
[2,4]  4.0

I need to check whether all elements of items are in each transaction if so sum up the costs. 如果需要汇总费用,我需要检查项目的所有元素是否都在每次交易中。

Expected Result 预期结果

id    transaction score
 1      [1,2,3]     3.0
 2      [2,3]       1.0

I did the following 我做了以下

#cross join
-----------
def cartesian_product_simplified(left, right):
   la, lb = len(left), len(right)
   ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])

    return pd.DataFrame(
    np.column_stack([left.values[ia2.ravel()], 
     right.values[ib2.ravel()]]))

out=cartesian_product_simplified(df1,df2) 

#column names assigning        
out.columns=['id', 'transactions', 'cost', 'items']

#converting panda series to list
t=out["transactions"].tolist()
item=out["items"].tolist()


#check list present in another list
-------------------------------------
def check(trans,itm):
out_list=list() 
for row in trans:
   ret =np.all(np.in1d(itm, row))
   out_list.append(ret)
return out_list

if true: group and sum
-----------------------
a=check(t,item)
for i in a:
  if(i):
   print(out.groupby(['id','transactions']))['cost'].sum()      
  else:
   print("no")

Throws TypeError: 'NoneType' object is not subscriptable. 引发TypeError:'NoneType'对象不可下标。

I am new to python and don't know how to put all these together. 我是python的新手,不知道如何将所有这些放在一起。 How to group by and sum the cost when all items of one list in another list? 当一个列表中的所有项目都在另一列表中时,如何对成本进行分组和求和?

The simplies way is just to check all items for all transactions: 简单的方法就是检查所有交易的所有项目:

# df1 and df2 are initialized

def sum_score(transaction):
    score = 0
    for _, row in df2.iterrows():
        if all(item in transaction for item in row["items"]):
            score += row["cost"]
    return score

df1["score"] = df1["transactions"].map(sum_score)

It will be extremely slow on big scale. 大规模它将极其缓慢。 If this is a problem, we need to iterate not over every item, but preselect only possible. 如果这是一个问题,我们不需要遍历每个项目,而只能预选可能的项目。 If you have enough memory, it can be done like that. 如果您有足够的内存,可以这样做。 For each item we remember all the row numbers in df2 , where it appeared. 对于每一项,我们都记住df2中出现的所有行号。 So for each transaction we get the items, get all the possible lines and check only them. 因此,对于每笔交易,我们都会得到项目,获得所有可能的行并仅检查它们。

import collections

# df1 and df2 are initialized

def get_sum_score_precalculated_func(items_cost_df):

    # create a dict of possible indexes to search for an item
    items_search_dict = collections.default_dict(set)
    for i, (_, row) in enumerate(items_cost_df.iterrow()):
        for item in row["items"]:
            items_search_dict[item].add(i)

    def sum_score(transaction):
        possible_indexes = set()
        for i in transaction:
            possible_indexes += items_search_dict[i]

        score = 0
        for i in possible_indexes:
            row = items_cost_df.iloc[i]
            if all(item in transaction for item in row["items"]):
                score += row["cost"]
        return score

    return sum_score

df1["score"] = df1["transactions"].map(get_sum_score_precalculated_func(df2))

Here I use set which is an unordered storage of unique values (it helps to join possible line numbers and avoid double count). 在这里,我使用set ,它是唯一值的无序存储(它有助于连接可能的行号并避免重复计数)。 collections.defaultdict which is a usual dict , but if you are trying to access uninitialized values it fill it with the given data (blank set in my case). collections.defaultdict这是一种常用的dict ,但如果你试图访问未初始化值它与给定的数据填充它(空set在我的情况)。 It help to avoid if x not in my_dict: my_dict[x] = set() . if x not in my_dict: my_dict[x] = set() ,则有助于避免if x not in my_dict: my_dict[x] = set() I also use so called "closure", which means sum_score function will have access to items_cost_df and items_search_dict which were accessible at the level the sum_score function was declared even after it was returned and get_sum_score_precalculated_func 我也使用所谓的“关闭”,这意味着sum_score函数将有权访问items_cost_dfitems_search_dict ,即使在返回sum_score函数和get_sum_score_precalculated_func声明后,它们sum_score可以在声明的级别访问

That should be much faster in case the items are quite unique and can be found only in a few lines of df2 . 如果项目非常独特,并且只能在df2的几行中找到,那应该会更快。

If you have quite a few unique items and so many identical transactions, you'd better calculate score for each unique transaction first. 如果您有很多独特的商品并且有很多相同的交易,则最好先为每个独特的交易计算分数。 And then just join the result. 然后加入结果。

transactions_score = []
for transaction in df1["transactions"].unique():
    score = sum_score(transaction)
    transaction_score.append([transaction, score])
transaction_score = pd.DataFrame(
    transaction_score,
    columns=["transactions", "score"])
df1 = df1.merge(transaction_score, on="transactions", how="left")

Here I use sum_score from first example of code 在这里,我使用第一个代码示例的sum_score

PS With the python error message there should be a line number which helps a lot to understand the problem. PS与python错误消息,应该有一个行号,这可以帮助很多理解问题。

# convert df_1 to dictionary for iteration
df_1_dict = dict(zip(df_1["id"], df_1["transactions"]))
# convert df_2 to list for iteration as there is no unique column
df_2_list = df_2.values.tolist()

# iterate through each combination to find a valid one
new_data = []
for rows in df_2_list:
    items = rows[0]
    costs = rows[1]
    for key, value in df_1_dict.items():
        # find common items in both
        common = set(value).intersection(set(items))
        # execute of common item exist in second dataframe 
        if len(common) == len(items):
            new_row = {"id": key, "transactions": value, "costs": costs}
            new_data.append(new_row)

merged_df = pd.DataFrame(new_data)
merged_df = merged_df[["id", "transactions", "costs"]]

# group the data by id to get total cost for each id
merged_df = (
    merged_df
    .groupby(["id"])
    .agg({"costs": "sum"})
    .reset_index()
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM