简体   繁体   English

有没有办法对熊猫/numpy中计数项目的共现进行矢量化?

[英]Is there a way to vectorize counting items' co-occurences in pandas/numpy?

I frequently need to generate network graphs based on the co-occurences of items in a column.我经常需要根据列中项目的共现生成网络图。 I start of with something like this:我从这样的事情开始:

           letters
0  [b, a, e, f, c]
1        [a, c, d]
2        [c, b, j]

In the following example, I want a to make a table of all pairs of letters, and then have a "weight" column, which describes how many times each two letter pair appeared in the same row together (see bottom for example).在下面的例子中,我想要一个包含所有字母对的表格,然后有一个“权重”列,它描述了每两个字母对一起出现在同一行中的次数(例如,见底部)。

I am currently doing large parts of it using a for loop, and I was wondering if there is a way for me to vectorize it, as I am often dealing with enormous datasets that take an extremely long time to process in this way.我目前正在使用 for 循环完成大部分工作,我想知道是否有办法将其矢量化,因为我经常处理大量数据集,这些数据集需要很长时间才能以这种方式处理。 I am also concerned about keeping things within memory limits.我还担心将事情保持在 memory 限制内。 This is my code right now:这是我现在的代码:

import pandas as pd

# Make some data
df = pd.DataFrame({'letters': [['b','a','e','f','c'],['a','c','d'],['c','b','j']]})

# I make a list of sets, which contain pairs of all the elements
# that co-occur in the data in the same list
sets = []
for lst in df['letters']:
    for i, a in enumerate(lst):
        for b in lst[i:]:
            if not a == b:
                sets.append({a, b})

# Sets now looks like:
# [{'a', 'b'},
#  {'b', 'e'},
#  {'b', 'f'},...

# Dataframe with one column containing the sets
df = pd.DataFrame({'weight': sets})

# We count how many times each pair occurs together
df = df['weight'].value_counts().reset_index()

# Split the sets into two seperate columns
split = pd.DataFrame(df['index'].values.tolist()) \
          .rename(columns = lambda x: f'Node{x+1}') \
          .fillna('-')

# Merge the 'weight' column back onto the dataframe
df = pd.concat([df['weight'], split], axis = 1)

print(df.head)

# Output:
   weight Node1 Node2
0       2     c     b
1       2     a     c
2       1     f     e
3       1     d     c
4       1     j     b

Notes:笔记:

As suggested in the other answers, make use of collections.Counter for the counting.正如其他答案中所建议的,使用collections.Counter进行计数。 Since it behaves like a dict though, it needs hashable types.由于它的行为类似于dict ,因此它需要可散列的类型。 {a,b} is not hashable, because it's a set. {a,b}不可散列,因为它是一个集合。 Replacing it with a tuple fixes the hashability problem, but introduces possible duplicates (eg, ('a', 'b') and ('b', 'a') ).用元组替换它可以解决哈希问题,但会引入可能的重复项(例如('a', 'b')('b', 'a') )。 To fix this issue, just sort the tuple.要解决此问题,只需对元组进行排序。

since sorted returns a list , we need to turn that back into a tuple: tuple(sorted((a,b))) .由于sorted返回一个list ,我们需要将其转回一个元组: tuple(sorted((a,b))) A bit cumbersome, but convenient in combination with Counter .有点麻烦,但与Counter结合使用很方便。

Quick and easy speedup: Comprehensions instead of loops快速简单的加速:理解而不是循环

When rearranged, your nested loops can be replaced with the following comprehension:重新排列后,您的嵌套循环可以替换为以下理解:

sets = [ sorted((a,b)) for lst in df['letters'] for i,a in enumerate(lst) for b in lst[i:] if not a == b ]

Python has optimizations in place for comprehension execution, so this will already bring some speedup. Python 对理解执行进行了优化,因此这已经带来了一些加速。

Bonus: If you combine it with Counter , you don't even need the result as a list, but can instead use a generator expression (almost no extra memory is used instead of storing all pairs):奖励:如果将它与Counter结合使用,您甚至不需要将结果作为列表,而是可以使用生成器表达式(几乎不使用额外的 memory 来代替存储所有对):

Counter( tuple(sorted((a, b))) for lst in lists for i,a in enumerate(lst) for b in lst[i:] if not a == b ) # note the lack of [ ] around the comprehension

Evaluation: What is the faster approach?评估:更快的方法是什么?

As usual, when dealing with performance, the final answer must come from testing different approaches and choosing the best one.像往常一样,在处理性能时,最终的答案必须来自测试不同的方法并选择最好的方法。 Here I compare the (IMO very elegant and readable) itertools -based approach by @yatu, the original nested-for and the comprehension.在这里,我比较了@yatu 的(IMO 非常优雅和易读的)基于itertools的方法,原始的nested-for 和理解。 All tests run on the same sample data, randomly generated to look like the given example.所有测试都在相同的样本数据上运行,随机生成看起来像给定的例子。

from timeit import timeit

setup = '''
import numpy as np
import random
from collections import Counter
from itertools import combinations, chain
random.seed(42)
np.random.seed(42)

DF_SIZE = 50000 # make it big
MAX_LEN = 6
list_lengths = np.random.randint(1, 7, DF_SIZE)

letters = 'abcdefghijklmnopqrstuvwxyz'

lists = [ random.sample(letters, ln) for ln in list_lengths ] # roughly equivalent to df.letters.tolist()
'''

#################

comprehension = '''Counter( tuple(sorted((a, b))) for lst in lists for i,a in enumerate(lst) for b in lst[i:] if not a == b )'''
itertools = '''Counter(chain.from_iterable(combinations(sorted(i), r=2) for i in lists))'''
original_for_loop = '''
sets = []
for lst in lists:
    for i, a in enumerate(lst):
        for b in lst[i:]:
            if not a == b:
                sets.append(tuple(sorted((a, b))))
Counter(sets)
'''

print(f'Comprehension: {timeit(setup=setup, stmt=comprehension, number=10)}')
print(f'itertools: {timeit(setup=setup, stmt=itertools, number=10)}')
print(f'nested for: {timeit(setup=setup, stmt=original_for_loop, number=10)}')

Running the code above on my machine (python 3.7) prints:在我的机器(python 3.7)上运行上面的代码会打印:

Comprehension: 1.6664735930098686
itertools: 0.5829475829959847
nested for: 1.751666523006861

So, both suggested approaches improve over the nested for loops, but itertools is indeed faster in this case.因此,两种建议的方法都改进了嵌套的 for 循环,但在这种情况下 itertools 确实更快。

For a performance improvement you could use itertooos.combinations in order to get all length 2 combinations from the inner lists, and Counter to count the pairs in a flattened list.为了提高性能,您可以使用itertooos.combinations从内部列表中获取所有长度为2的组合,并使用Counter来计算扁平列表中的对。

Note that in addition to obtaining all combinations from each sublist, sorting is a necessary step since it will ensure that all pairs of tuples will appear in the same order:请注意,除了从每个子列表中获取所有组合之外,排序是必要的步骤,因为它将确保所有元组对以相同的顺序出现:

from itertools import combinations, chain
from collections import Counter

l = df.letters.tolist()
t = chain.from_iterable(combinations(sorted(i), r=2) for i in l)

print(Counter(t))

Counter({('a', 'b'): 1,
         ('a', 'c'): 2,
         ('a', 'e'): 1,
         ('a', 'f'): 1,
         ('b', 'c'): 2,
         ('b', 'e'): 1,
         ('b', 'f'): 1,
         ('c', 'e'): 1,
         ('c', 'f'): 1,
         ('e', 'f'): 1,
         ('a', 'd'): 1,
         ('c', 'd'): 1,
         ('b', 'j'): 1,
         ('c', 'j'): 1})

A numpy/scipy solution using sparse incidence matrices:使用稀疏关联矩阵的 numpy/scipy 解决方案:

from itertools import chain
import numpy as np
from scipy import sparse
from simple_benchmark import BenchmarkBuilder, MultiArgument

B = BenchmarkBuilder()

@B.add_function()
def pp(L):
    SZS = np.fromiter(chain((0,),map(len,L)),int,len(L)+1).cumsum()
    unq,idx = np.unique(np.concatenate(L),return_inverse=True)
    S = sparse.csr_matrix((np.ones(idx.size,int),idx,SZS),(len(L),len(unq)))
    SS = (S.T@S).tocoo()
    idx = (SS.col>SS.row).nonzero()
    return unq[SS.row[idx]],unq[SS.col[idx]],SS.data[idx] # left, right, count


from collections import Counter
from itertools import combinations

@B.add_function()
def yatu(L):
    return Counter(chain.from_iterable(combinations(sorted(i),r=2) for i in L))

@B.add_function()
def feature_engineer(L):
    Counter((min(nodes), max(nodes))
            for row in L for nodes in combinations(row, 2))

from string import ascii_lowercase as ltrs

ltrs = np.array([*ltrs])

@B.add_arguments('array size')
def argument_provider():
    for exp in range(4, 30):
        n = int(1.4**exp)
        L = [ltrs[np.maximum(0,np.random.randint(-2,2,26)).astype(bool).tolist()] for _ in range(n)]
        yield n,L

r = B.run()
r.plot()

在此处输入图像描述

We see that the method presented here ( pp ) comes with the typical numpy constant overhead, but from ~100 sublists it starts winning.我们看到这里介绍的方法 ( pp ) 带有典型的 numpy 恒定开销,但是从大约 100 个子列表开始它开始获胜。

OPs example: OP 示例:

import pandas as pd

df = pd.DataFrame({'letters': [['b','a','e','f','c'],['a','c','d'],['c','b','j']]})
pd.DataFrame(dict(zip(["left", "right", "count"],pp(df['letters']))))

Prints:印刷:

   left right  count
0     a     b      1
1     a     c      2
2     b     c      2
3     c     d      1
4     a     d      1
5     c     e      1
6     a     e      1
7     b     e      1
8     c     f      1
9     e     f      1
10    a     f      1
11    b     f      1
12    b     j      1
13    c     j      1

Notes to improve efficiency:提高效率的注意事项:

  1. Instead of storing the pairs in sets, which are memory hogs and require expensive computation for adding elements, use a tuple where the first element is the smallest.不要将这些对存储在集合中,即 memory hogs 并且需要昂贵的计算来添加元素,而是使用第一个元素最小的元组。

  2. To calculate the combinations quickly, use itertools.combinations.要快速计算组合,请使用 itertools.combinations。

  3. To count the combinations use collections.Counter要计算组合,请使用 collections.Counter

  4. optionally, convert the count to a DataFrame.可选地,将计数转换为 DataFrame。

Here's an example implementation:这是一个示例实现:

from collections import Counter
from itertools import combinations

data = df.letters.tolist()

#    data = [['b', 'a', 'e', 'f', 'c'],
#            ['a', 'c', 'd'],
#            ['c', 'b', 'j']]

counts = Counter((min(nodes), max(nodes)) for row in data for nodes in combinations(row, 2))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM