pandas 滾動 window 聚合字符串列

Question

在 pandas 上使用滾動 window 計算字符串聚合操作時，我很掙扎。

我得到了當前的 df，其中t_dat 是購買日期， customer_id和article_id是不言自明的。

t_dat	客戶ID	文章編號
2020-04-24	486230	781570001
2020-04-24	486230	598755030
2020-04-27	486230	836997001
2020-05-02	486230	687707005
2020-06-03	486230	741356002

我想按 customer_id 分組並在每周滾動 window 上連接文章 id（例如下表中的article_ids列。pandas 似乎不支持字符串列的滾動 window 聚合因此我嘗試重新采樣，但它沒有完成我的期望（查看下表了解我的預期結果）

t_dat	客戶ID	文章編號	article_ids
2020-04-24	486230	781570001	598755030 836997001
2020-04-24	486230	598755030	781570001 836997001
2020-04-27	486230	836997001	836997001 687707005
2020-05-02	486230	687707005	687707005
2020-06-03	486230	741356002	741356002

我的目標是實際了解不同 article_id 之間是否存在購買模式（即是否在任何客戶購買另一篇文章后不久購買了一些文章？）

為了使它更明確，我試圖分兩個步驟來構建問題：

客戶在購買任何其他物品后 7 天內購買了哪些物品？ 我想為每個客戶和每個購買的產品重復這個練習
完成此操作后，我想從其他產品中找出那些組合購買頻率更高（一周內）的文章，這樣我就可以構建一個基本的推薦系統。

在這里，我正在尋找 1 號的解決方案。

我都試過了

df.groupby('customer_id').rolling('7D', on = 't_dat', min_periods = 1)['article_id'].agg(' '.join).reset_index()

或者

df.groupby('customer_id').rolling('7D', on = 't_dat', min_periods = 1)['article_id'].apply(lambda x: ' '.join(x.astype(str))).reset_index()

並且，使用重采樣，

df.groupby('customer_id').resample('7D', on = 't_dat')['article_id'].agg(' '.join).reset_index()

沒有成功。 第一個是因為錯誤TypeError: sequence item 0: expected str instance, float found並且，當我將字符串類型轉換為article_id時，它返回TypeError: must be real number, not str ； 第二次嘗試，因為它沒有返回我需要的正確偏移量（從數據集中第一次出現開始需要一周的間隔，然后繼續設置每周間隔而不滾動偏移）

我已經編寫了一個替代方案，但它看起來非常慢，我會利用 pandas 向量化操作來加速它：

# for each article_id in every purchase, I want to check which other articles where bought within the following week

articles_list = df.groupby(['customer_id', 't_dat'])['article_id'].apply(list).reset_index()

def get_recommendations():

    dict_recs = {}

    for n, row in df.iterrows():
        customer = row['customer_id']
        date_purchase = row['t_dat']
        articles_purchase = row['article_id']
        df_clean = df[(df['customer_id'] == customer) & (df['t_dat'] <= date_purchase + timedelta(days=7)) & (df['t_dat'] >= date_purchase)]
        articles_to_recommend = df_clean['article_id']
        
        print("Iterating over {} row".format(n))
        # print("Articles in scope are {} \n".format(articles_to_recommend))
        
        for article in articles_purchase:
            articles_list_to_iter = [i[j] for i in articles_to_recommend for j in range(len(i)) if i[j] != article]
            # print("Articles preprocessed are {} \n".format(articles_list_to_iter))
            if article not in dict_recs:
                dict_recs[article] = articles_list_to_iter
            else:
                dict_recs[article].extend(articles_list_to_iter)
    
    recs_list = {k: Counter(v).most_common(12) for k, v in dict_recs.items()}

    return recs_list

你能建議我可以用來完成我正在尋找的任何替代方法嗎？

Answer 1

我能夠按天匯總。 創建第二個 dataframe 並按客戶每天累積所有文章。 使用 pd.Grouper 創建您的 7 天滾動窗口！

data="""
t_dat   customer_id article_id
2020-04-24  486230  781570001
2020-04-24  486230  598755030
2020-04-27  486230  836997001
2020-05-02  486230  687707005
2020-06-03  486230  741356002
"""

df = pd.read_csv(StringIO(data), sep='\t')
df['t_dat'] = pd.to_datetime(df['t_dat'])
df = df.sort_values(by=['t_dat'])
#grouped = df.groupby(['t_dat', 'customer_id']).agg({'article_id': lambda x: list(x)})

#grouped=grouped.reset_index()
#df=pd.DataFrame(grouped)
df = df.set_index('t_dat')
print(df)
df = df.groupby(['customer_id', pd.Grouper(level='t_dat', freq='7D')])['article_id'].apply(list).reset_index()
print(df)

output：

customer_id      t_dat                         article_id
0       486230 2020-04-24  [781570001, 598755030, 836997001]
1       486230 2020-05-01                        [687707005]
2       486230 2020-05-29                        [741356002]

pandas 滾動 window 聚合字符串列

問題描述

1 個解決方案

解決方案1
0 2022-03-18 21:17:04

pandas 滾動 window 聚合字符串列

問題描述

1 個解決方案

解決方案1 0 2022-03-18 21:17:04

解決方案1
0 2022-03-18 21:17:04