![](/img/trans.png)
[英]How to group items in one list based on consecutive elements in another list?
[英]How to group by and sum when all elements of one list are in another list
我有一個數據幀df1。 “交易”列具有一個int數組。
id transactions
1 [1,2,3]
2 [2,3]
數據幀df2。 “ items”列具有一個int數組。
items cost
[1,2] 2.0
[2] 1.0
[2,4] 4.0
如果需要匯總費用,我需要檢查項目的所有元素是否都在每次交易中。
預期結果
id transaction score
1 [1,2,3] 3.0
2 [2,3] 1.0
我做了以下
#cross join
-----------
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()],
right.values[ib2.ravel()]]))
out=cartesian_product_simplified(df1,df2)
#column names assigning
out.columns=['id', 'transactions', 'cost', 'items']
#converting panda series to list
t=out["transactions"].tolist()
item=out["items"].tolist()
#check list present in another list
-------------------------------------
def check(trans,itm):
out_list=list()
for row in trans:
ret =np.all(np.in1d(itm, row))
out_list.append(ret)
return out_list
if true: group and sum
-----------------------
a=check(t,item)
for i in a:
if(i):
print(out.groupby(['id','transactions']))['cost'].sum()
else:
print("no")
引發TypeError:'NoneType'對象不可下標。
我是python的新手,不知道如何將所有這些放在一起。 當一個列表中的所有項目都在另一列表中時,如何對成本進行分組和求和?
簡單的方法就是檢查所有交易的所有項目:
# df1 and df2 are initialized
def sum_score(transaction):
score = 0
for _, row in df2.iterrows():
if all(item in transaction for item in row["items"]):
score += row["cost"]
return score
df1["score"] = df1["transactions"].map(sum_score)
大規模它將極其緩慢。 如果這是一個問題,我們不需要遍歷每個項目,而只能預選可能的項目。 如果您有足夠的內存,可以這樣做。 對於每一項,我們都記住df2
中出現的所有行號。 因此,對於每筆交易,我們都會得到項目,獲得所有可能的行並僅檢查它們。
import collections
# df1 and df2 are initialized
def get_sum_score_precalculated_func(items_cost_df):
# create a dict of possible indexes to search for an item
items_search_dict = collections.default_dict(set)
for i, (_, row) in enumerate(items_cost_df.iterrow()):
for item in row["items"]:
items_search_dict[item].add(i)
def sum_score(transaction):
possible_indexes = set()
for i in transaction:
possible_indexes += items_search_dict[i]
score = 0
for i in possible_indexes:
row = items_cost_df.iloc[i]
if all(item in transaction for item in row["items"]):
score += row["cost"]
return score
return sum_score
df1["score"] = df1["transactions"].map(get_sum_score_precalculated_func(df2))
在這里,我使用set
,它是唯一值的無序存儲(它有助於連接可能的行號並避免重復計數)。 collections.defaultdict
這是一種常用的dict
,但如果你試圖訪問未初始化值它與給定的數據填充它(空set
在我的情況)。 if x not in my_dict: my_dict[x] = set()
,則有助於避免if x not in my_dict: my_dict[x] = set()
。 我也使用所謂的“關閉”,這意味着sum_score
函數將有權訪問items_cost_df
和items_search_dict
,即使在返回sum_score
函數和get_sum_score_precalculated_func
聲明后,它們sum_score
可以在聲明的級別訪問
如果項目非常獨特,並且只能在df2
的幾行中找到,那應該會更快。
如果您有很多獨特的商品並且有很多相同的交易,則最好先為每個獨特的交易計算分數。 然后加入結果。
transactions_score = []
for transaction in df1["transactions"].unique():
score = sum_score(transaction)
transaction_score.append([transaction, score])
transaction_score = pd.DataFrame(
transaction_score,
columns=["transactions", "score"])
df1 = df1.merge(transaction_score, on="transactions", how="left")
在這里,我使用第一個代碼示例的sum_score
PS與python錯誤消息,應該有一個行號,這可以幫助很多理解問題。
# convert df_1 to dictionary for iteration
df_1_dict = dict(zip(df_1["id"], df_1["transactions"]))
# convert df_2 to list for iteration as there is no unique column
df_2_list = df_2.values.tolist()
# iterate through each combination to find a valid one
new_data = []
for rows in df_2_list:
items = rows[0]
costs = rows[1]
for key, value in df_1_dict.items():
# find common items in both
common = set(value).intersection(set(items))
# execute of common item exist in second dataframe
if len(common) == len(items):
new_row = {"id": key, "transactions": value, "costs": costs}
new_data.append(new_row)
merged_df = pd.DataFrame(new_data)
merged_df = merged_df[["id", "transactions", "costs"]]
# group the data by id to get total cost for each id
merged_df = (
merged_df
.groupby(["id"])
.agg({"costs": "sum"})
.reset_index()
)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.