有沒有辦法對熊貓/numpy中計數項目的共現進行矢量化？

Question

我經常需要根據列中項目的共現生成網絡圖。 我從這樣的事情開始：

           letters
0  [b, a, e, f, c]
1        [a, c, d]
2        [c, b, j]

在下面的例子中，我想要一個包含所有字母對的表格，然后有一個“權重”列，它描述了每兩個字母對一起出現在同一行中的次數（例如，見底部）。

我目前正在使用 for 循環完成大部分工作，我想知道是否有辦法將其矢量化，因為我經常處理大量數據集，這些數據集需要很長時間才能以這種方式處理。 我還擔心將事情保持在 memory 限制內。 這是我現在的代碼：

import pandas as pd

# Make some data
df = pd.DataFrame({'letters': [['b','a','e','f','c'],['a','c','d'],['c','b','j']]})

# I make a list of sets, which contain pairs of all the elements
# that co-occur in the data in the same list
sets = []
for lst in df['letters']:
    for i, a in enumerate(lst):
        for b in lst[i:]:
            if not a == b:
                sets.append({a, b})

# Sets now looks like:
# [{'a', 'b'},
#  {'b', 'e'},
#  {'b', 'f'},...

# Dataframe with one column containing the sets
df = pd.DataFrame({'weight': sets})

# We count how many times each pair occurs together
df = df['weight'].value_counts().reset_index()

# Split the sets into two seperate columns
split = pd.DataFrame(df['index'].values.tolist()) \
          .rename(columns = lambda x: f'Node{x+1}') \
          .fillna('-')

# Merge the 'weight' column back onto the dataframe
df = pd.concat([df['weight'], split], axis = 1)

print(df.head)

# Output:
   weight Node1 Node2
0       2     c     b
1       2     a     c
2       1     f     e
3       1     d     c
4       1     j     b

Answer 1

筆記：

正如其他答案中所建議的，使用collections.Counter進行計數。 由於它的行為類似於dict ，因此它需要可散列的類型。 {a,b}不可散列，因為它是一個集合。 用元組替換它可以解決哈希問題，但會引入可能的重復項（例如('a', 'b')和('b', 'a') ）。 要解決此問題，只需對元組進行排序。

由於sorted返回一個list ，我們需要將其轉回一個元組： tuple(sorted((a,b))) 。 有點麻煩，但與Counter結合使用很方便。

快速簡單的加速：理解而不是循環

重新排列后，您的嵌套循環可以替換為以下理解：

sets = [ sorted((a,b)) for lst in df['letters'] for i,a in enumerate(lst) for b in lst[i:] if not a == b ]

Python 對理解執行進行了優化，因此這已經帶來了一些加速。

獎勵：如果將它與Counter結合使用，您甚至不需要將結果作為列表，而是可以使用生成器表達式（幾乎不使用額外的 memory 來代替存儲所有對）：

Counter( tuple(sorted((a, b))) for lst in lists for i,a in enumerate(lst) for b in lst[i:] if not a == b ) # note the lack of [ ] around the comprehension

評估：更快的方法是什么？

像往常一樣，在處理性能時，最終的答案必須來自測試不同的方法並選擇最好的方法。 在這里，我比較了@yatu 的（IMO 非常優雅和易讀的）基於itertools的方法，原始的nested-for 和理解。 所有測試都在相同的樣本數據上運行，隨機生成看起來像給定的例子。

from timeit import timeit

setup = '''
import numpy as np
import random
from collections import Counter
from itertools import combinations, chain
random.seed(42)
np.random.seed(42)

DF_SIZE = 50000 # make it big
MAX_LEN = 6
list_lengths = np.random.randint(1, 7, DF_SIZE)

letters = 'abcdefghijklmnopqrstuvwxyz'

lists = [ random.sample(letters, ln) for ln in list_lengths ] # roughly equivalent to df.letters.tolist()
'''

#################

comprehension = '''Counter( tuple(sorted((a, b))) for lst in lists for i,a in enumerate(lst) for b in lst[i:] if not a == b )'''
itertools = '''Counter(chain.from_iterable(combinations(sorted(i), r=2) for i in lists))'''
original_for_loop = '''
sets = []
for lst in lists:
    for i, a in enumerate(lst):
        for b in lst[i:]:
            if not a == b:
                sets.append(tuple(sorted((a, b))))
Counter(sets)
'''

print(f'Comprehension: {timeit(setup=setup, stmt=comprehension, number=10)}')
print(f'itertools: {timeit(setup=setup, stmt=itertools, number=10)}')
print(f'nested for: {timeit(setup=setup, stmt=original_for_loop, number=10)}')

在我的機器（python 3.7）上運行上面的代碼會打印：

Comprehension: 1.6664735930098686
itertools: 0.5829475829959847
nested for: 1.751666523006861

因此，兩種建議的方法都改進了嵌套的 for 循環，但在這種情況下 itertools 確實更快。

Answer 2

為了提高性能，您可以使用itertooos.combinations從內部列表中獲取所有長度為2的組合，並使用Counter來計算扁平列表中的對。

請注意，除了從每個子列表中獲取所有組合之外，排序是必要的步驟，因為它將確保所有元組對以相同的順序出現：

from itertools import combinations, chain
from collections import Counter

l = df.letters.tolist()
t = chain.from_iterable(combinations(sorted(i), r=2) for i in l)

print(Counter(t))

Counter({('a', 'b'): 1,
         ('a', 'c'): 2,
         ('a', 'e'): 1,
         ('a', 'f'): 1,
         ('b', 'c'): 2,
         ('b', 'e'): 1,
         ('b', 'f'): 1,
         ('c', 'e'): 1,
         ('c', 'f'): 1,
         ('e', 'f'): 1,
         ('a', 'd'): 1,
         ('c', 'd'): 1,
         ('b', 'j'): 1,
         ('c', 'j'): 1})

Answer 3

使用稀疏關聯矩陣的 numpy/scipy 解決方案：

from itertools import chain
import numpy as np
from scipy import sparse
from simple_benchmark import BenchmarkBuilder, MultiArgument

B = BenchmarkBuilder()

@B.add_function()
def pp(L):
    SZS = np.fromiter(chain((0,),map(len,L)),int,len(L)+1).cumsum()
    unq,idx = np.unique(np.concatenate(L),return_inverse=True)
    S = sparse.csr_matrix((np.ones(idx.size,int),idx,SZS),(len(L),len(unq)))
    SS = (S.T@S).tocoo()
    idx = (SS.col>SS.row).nonzero()
    return unq[SS.row[idx]],unq[SS.col[idx]],SS.data[idx] # left, right, count


from collections import Counter
from itertools import combinations

@B.add_function()
def yatu(L):
    return Counter(chain.from_iterable(combinations(sorted(i),r=2) for i in L))

@B.add_function()
def feature_engineer(L):
    Counter((min(nodes), max(nodes))
            for row in L for nodes in combinations(row, 2))

from string import ascii_lowercase as ltrs

ltrs = np.array([*ltrs])

@B.add_arguments('array size')
def argument_provider():
    for exp in range(4, 30):
        n = int(1.4**exp)
        L = [ltrs[np.maximum(0,np.random.randint(-2,2,26)).astype(bool).tolist()] for _ in range(n)]
        yield n,L

r = B.run()
r.plot()

我們看到這里介紹的方法 ( pp ) 帶有典型的 numpy 恆定開銷，但是從大約 100 個子列表開始它開始獲勝。

OP 示例：

import pandas as pd

df = pd.DataFrame({'letters': [['b','a','e','f','c'],['a','c','d'],['c','b','j']]})
pd.DataFrame(dict(zip(["left", "right", "count"],pp(df['letters']))))

印刷：

   left right  count
0     a     b      1
1     a     c      2
2     b     c      2
3     c     d      1
4     a     d      1
5     c     e      1
6     a     e      1
7     b     e      1
8     c     f      1
9     e     f      1
10    a     f      1
11    b     f      1
12    b     j      1
13    c     j      1

Answer 4

提高效率的注意事項：

不要將這些對存儲在集合中，即 memory hogs 並且需要昂貴的計算來添加元素，而是使用第一個元素最小的元組。
要快速計算組合，請使用 itertools.combinations。
要計算組合，請使用 collections.Counter
可選地，將計數轉換為 DataFrame。

這是一個示例實現：

from collections import Counter
from itertools import combinations

data = df.letters.tolist()

#    data = [['b', 'a', 'e', 'f', 'c'],
#            ['a', 'c', 'd'],
#            ['c', 'b', 'j']]

counts = Counter((min(nodes), max(nodes)) for row in data for nodes in combinations(row, 2))

有沒有辦法對熊貓/numpy中計數項目的共現進行矢量化？

問題描述

3 個解決方案

解決方案1
1 2019-10-23 09:09:53

筆記：

快速簡單的加速：理解而不是循環

評估：更快的方法是什么？

解決方案2
1 2019-10-23 09:16:20

解決方案3
1 2019-10-23 10:50:00

解決方案4
0 2019-10-23 09:25:46

有沒有辦法對熊貓/numpy中計數項目的共現進行矢量化？

問題描述

3 個解決方案

解決方案1 1 2019-10-23 09:09:53

筆記：

快速簡單的加速：理解而不是循環

評估：更快的方法是什么？

解決方案2 1 2019-10-23 09:16:20

解決方案3 1 2019-10-23 10:50:00

解決方案4 0 2019-10-23 09:25:46

解決方案1
1 2019-10-23 09:09:53

解決方案2
1 2019-10-23 09:16:20

解決方案3
1 2019-10-23 10:50:00

解決方案4
0 2019-10-23 09:25:46