[英]How to create Pandas pivot table that counts common variables?
我創建了以下 dataframe:
df = pd.DataFrame({
'Product ID': ['shirt', 'dress', 'shirt', 'pants', 'jacket', 'jacket', 'dress', 'hat'],
'Discount Group': [1, 2, 3, 2, 1, 3, 4, 5]
})
Product ID Discount Group
0 shirt 1
1 dress 2
2 shirt 3
3 pants 2
4 jacket 1
5 jacket 3
6 dress 4
7 hat 5
我想創建一個 pivot 表,其中行和列都是"Discount Group"
,表值將是來自"Product ID"
的共享項目的計數。 例如,1(列)和3(行)都有“襯衫”作為共同項目,所以它們的值應該是1。
它應該如下所示:
1 2 3 4 5
1 1 0 1 0 0
2 0 1 0 1 0
3 1 0 1 1 0
4 0 1 0 1 0
5 0 0 0 0 1
我努力了
df.pivot_table(values='product id', index=['discount group'], columns='discount group', aggfunc='count')
這會返回
1 2 3 4 5
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 1 0 0
4 0 0 0 1 0
5 0 0 0 0 1
我不確定pivot_table
在這里會有所幫助,但這是您可以做的
首先, groupby
在“折扣組”上進行分組,並將所有“產品 ID”放入列表中:
df2 = df.groupby('Discount Group')['Product ID'].apply(list).reset_index()
df2
我們得到
Discount Group Product ID
-- ---------------- -------------------
0 1 ['shirt', 'jacket']
1 2 ['dress', 'pants']
2 3 ['shirt', 'jacket']
3 4 ['dress']
4 5 ['hat']
接下來,我們想用它自己制作這個 df 的“笛卡爾積”。 為此,我們對常量鍵進行外部合並
df2['key'] = 0
df3 = df2.merge(df2, on = 'key', how = 'outer').drop(columns=['key'])
df3
我們得到這個
Discount Group_x Product ID_x Discount Group_y Product ID_y
-- ------------------ ------------------- ------------------ -------------------
0 1 ['shirt', 'jacket'] 1 ['shirt', 'jacket']
1 1 ['shirt', 'jacket'] 2 ['dress', 'pants']
2 1 ['shirt', 'jacket'] 3 ['shirt', 'jacket']
3 1 ['shirt', 'jacket'] 4 ['dress']
4 1 ['shirt', 'jacket'] 5 ['hat']
5 2 ['dress', 'pants'] 1 ['shirt', 'jacket']
6 2 ['dress', 'pants'] 2 ['dress', 'pants']
7 2 ['dress', 'pants'] 3 ['shirt', 'jacket']
8 2 ['dress', 'pants'] 4 ['dress']
9 2 ['dress', 'pants'] 5 ['hat']
10 3 ['shirt', 'jacket'] 1 ['shirt', 'jacket']
11 3 ['shirt', 'jacket'] 2 ['dress', 'pants']
12 3 ['shirt', 'jacket'] 3 ['shirt', 'jacket']
13 3 ['shirt', 'jacket'] 4 ['dress']
14 3 ['shirt', 'jacket'] 5 ['hat']
15 4 ['dress'] 1 ['shirt', 'jacket']
16 4 ['dress'] 2 ['dress', 'pants']
17 4 ['dress'] 3 ['shirt', 'jacket']
18 4 ['dress'] 4 ['dress']
19 4 ['dress'] 5 ['hat']
20 5 ['hat'] 1 ['shirt', 'jacket']
21 5 ['hat'] 2 ['dress', 'pants']
22 5 ['hat'] 3 ['shirt', 'jacket']
23 5 ['hat'] 4 ['dress']
24 5 ['hat'] 5 ['hat']
請注意我們如何將每對“折扣組”和相應的“產品 ID”放在單獨的行中
接下來,對於每一行,我們計算“Product ID_x”和“Product ID_y”列表中存在的產品數量,並將其放入“count”列
df3['count'] = df3.apply(lambda row : len(set(row['Product ID_x'])&set(row['Product ID_y'])), axis = 1)[
df3
所以看起來像這樣
Discount Group_x Product ID_x Discount Group_y Product ID_y count
-- ------------------ ------------------- ------------------ ------------------- -------
0 1 ['shirt', 'jacket'] 1 ['shirt', 'jacket'] 2
1 1 ['shirt', 'jacket'] 2 ['dress', 'pants'] 0
2 1 ['shirt', 'jacket'] 3 ['shirt', 'jacket'] 2
3 1 ['shirt', 'jacket'] 4 ['dress'] 0
4 1 ['shirt', 'jacket'] 5 ['hat'] 0
5 2 ['dress', 'pants'] 1 ['shirt', 'jacket'] 0
6 2 ['dress', 'pants'] 2 ['dress', 'pants'] 2
7 2 ['dress', 'pants'] 3 ['shirt', 'jacket'] 0
8 2 ['dress', 'pants'] 4 ['dress'] 1
9 2 ['dress', 'pants'] 5 ['hat'] 0
10 3 ['shirt', 'jacket'] 1 ['shirt', 'jacket'] 2
11 3 ['shirt', 'jacket'] 2 ['dress', 'pants'] 0
12 3 ['shirt', 'jacket'] 3 ['shirt', 'jacket'] 2
13 3 ['shirt', 'jacket'] 4 ['dress'] 0
14 3 ['shirt', 'jacket'] 5 ['hat'] 0
15 4 ['dress'] 1 ['shirt', 'jacket'] 0
16 4 ['dress'] 2 ['dress', 'pants'] 1
17 4 ['dress'] 3 ['shirt', 'jacket'] 0
18 4 ['dress'] 4 ['dress'] 1
19 4 ['dress'] 5 ['hat'] 0
20 5 ['hat'] 1 ['shirt', 'jacket'] 0
21 5 ['hat'] 2 ['dress', 'pants'] 0
22 5 ['hat'] 3 ['shirt', 'jacket'] 0
23 5 ['hat'] 4 ['dress'] 0
24 5 ['hat'] 5 ['hat'] 1
我們幾乎完成了——設置索引並取消堆棧:
df3.set_index(['Discount Group_x','Discount Group_y'])['count'].unstack(level = 1)
要得到
Discount Group_y 1 2 3 4 5
Discount Group_x
1 2 0 2 0 0
2 0 2 0 1 0
3 2 0 2 0 0
4 0 1 0 1 0
5 0 0 0 0 1
...但有點丑陋
from itertools import product
s = df.groupby('Discount Group')['Product ID'].apply(list)
pairs = [[(p[0][0],p[1][0]),(p[0][1] ,p[1][1])] for p in product(s.items(),repeat = 2)]
count = [[p[0][0],p[0][1],len(set(p[1][0])&set(p[1][1]))] for p in pairs]
count
在第一列和第二columns
中生成帶有折扣 ID 的列表列表以及重疊項目的計數:
[[1, 1, 2],
[1, 2, 0],
[1, 3, 2],
[1, 4, 0],
[1, 5, 0],
[2, 1, 0],
[2, 2, 2],
[2, 3, 0],
[2, 4, 1],
[2, 5, 0],
[3, 1, 2],
[3, 2, 0],
[3, 3, 2],
[3, 4, 0],
[3, 5, 0],
[4, 1, 0],
[4, 2, 1],
[4, 3, 0],
[4, 4, 1],
[4, 5, 0],
[5, 1, 0],
[5, 2, 0],
[5, 3, 0],
[5, 4, 0],
[5, 5, 1]]
現在我們將它放入 df 並取消堆疊
pd.DataFrame(count).set_index([0,1]).unstack(level = 1)
生產
2
1 1 2 3 4 5
0
1 2 0 2 0 0
2 0 2 0 1 0
3 2 0 2 0 0
4 0 1 0 1 0
5 0 0 0 0 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.