使用pandas DataFrame行的組合填充“計數矩陣”

Question

假設我在Python3.x中有以下pandas DataFrame

import pandas as pd

dict1 = {'name':['dog', 'dog', 'cat', 'cat', 'cat', 'bird', 'bird', 'bird', 'bird'], 'number':[42, 42, 42, 42, 42, 42, 42, 42, 42], 'count':[1, 2, 4, 5, 7, 1, 2, 5, 8]} 
df = pd.DataFrame(dict1)

print(df)
##    name  number  count
## 0   dog      42      1
## 1   dog      42      2
## 2   cat      42      4
## 3   cat      42      5
## 4   cat      42      7
## 5  bird      42      1
## 6  bird      42      2
## 7  bird      42      5
## 8  bird      42      8

列counts包含從1到8的整數。我的目標是在列name給定唯一類別的情況下，使用每個組合“對”的計數填充8乘8的零矩陣。

因此， dog ， cat和bird的組合對是：

dog: (1, 2)
cat: (4, 5), (4, 7), (5, 7)
bird: (1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8)

對於每對，我將+1添加到零矩陣中的相應條目。

該矩陣是對稱的，即(n, m) = (m, n) 。 給定df的矩陣將是：

   1 2 3 4 5 6 7 8
1: 0 2 0 0 1 0 0 1
2: 2 0 0 0 1 0 0 1
3: 0 0 0 0 0 0 0 0
4: 0 0 0 0 1 0 1 0
5: 1 1 0 1 0 0 1 1
6: 0 0 0 0 0 0 0 0
7: 0 0 0 1 1 0 0 0
8: 1 1 0 0 1 0 0 0

注意， (1,2)=(2,1)具有來自dog組合和bird組合的計數2。

（1）為了做到這一點，我認為最好在給定pandas DataFrame的情況下創建一個“組合元組”列表。

就是這樣的

list_combos = [(1, 2), (2, 1), (4, 5), (4, 7), (5, 7), (5, 4), (7, 4), (7, 5),
    (1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8), (2, 1), (5, 1),
    (8, 1), (5, 2), (8, 2), (8, 5)]

鑒於矩陣是對稱的，也許最好使用：

list_combos2 = [(1, 2), (4, 5), (4, 7), (5, 7), (1, 2), (1, 5), (1, 8), (2, 5), (2, 8), (5, 8)]

考慮到“名字”中的分類值，如何計算pandas DataFrame中entires的排列？

（2）在給定元組列表的情況下，填充此矩陣的算法效率最高（即RAM）是多少？

我應該能夠將一個元組列表提供給一個numpy數組，但是如何填充零？

Answer 1

您可以使用groupby，迭代組合，並像這樣構建矩陣：

import numpy as np
from itertools import combinations

mat = np.zeros((df['count'].max(), ) * 2)
idx = []
for _, g in df.groupby('name'):
    idx.extend(combinations(g['count'] - 1, r=2))

np.add.at(mat, list(zip(*idx)), 1)
mat += mat.T

array([[0., 2., 0., 0., 1., 0., 0., 1.],
       [2., 0., 0., 0., 1., 0., 0., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 1., 0.],
       [1., 1., 0., 1., 0., 0., 1., 1.],
       [0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 1., 0., 0., 0.],
       [1., 1., 0., 0., 1., 0., 0., 0.]])

可能有一個更快的解決方案，但這是我能想到的最干凈的解決方案。

Answer 2

使用Numpy的bincount

from itertools import combinations, chain
from collections import defaultdict

d = defaultdict(list)
for tup in df.itertuples():
    d[tup.name].append(tup.count)

i, j = zip(*chain(*(combinations(v, 2) for v in d.values())))
i, j = np.array(i + j) - 1, np.array(j + i) - 1

np.bincount(i * 8 + j, minlength=64).reshape(8, 8)

array([[0, 2, 0, 0, 1, 0, 0, 1],
       [2, 0, 0, 0, 1, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 1, 0, 1, 0],
       [1, 1, 0, 1, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 1, 1, 0, 0, 0],
       [1, 1, 0, 0, 1, 0, 0, 0]])

使用pandas DataFrame行的組合填充“計數矩陣”

問題描述

2 個解決方案

解決方案1
6 已采納 2018-08-12 18:34:16

解決方案2
3 2018-08-12 20:08:38

使用pandas DataFrame行的組合填充“計數矩陣”

問題描述

2 個解決方案

解決方案1 6 已采納 2018-08-12 18:34:16

解決方案2 3 2018-08-12 20:08:38

解決方案1
6 已采納 2018-08-12 18:34:16

解決方案2
3 2018-08-12 20:08:38