简体   繁体   中英

Any alternate algorithms/packages to calculate co-occurrence matrix in Python?


I'm trying to calculate the co-occurrence matrix for a large corpus but it takes a very long time(+6hours). is there any faster ways or packages to calculate it faster?

My approach

consider this array as the corpus and each element of the corpus as context :

 corpus = [ 'where python is used', 'what is python used in', 'why python is best', 'what companies use python' ]

Algorithm

words = list(set(' '.join(corpus).split(' '))) c_matrix = np.zeros((len(words), len(words)), dtype='int') for context in corpus: context = context.split(' ') for i in range(len(context)): for j in range(i + 1, len(context)): row = words.index(context[i]) column = words.index(context[j]) c_matrix[row][column] += 1

The provided algorithm is not efficient because it recompute words.index(...) a lot of time. You can pre-compute the indices first and then build the matrix. Here is a significantly better solution:

words = list(set(' '.join(corpus).split(' ')))
c_matrix = np.zeros((len(words), len(words)), dtype='int')

for context in corpus:
    context = context.split(' ')
    index = [words.index(item) for item in context]
    for i in range(len(context)):
        for j in range(i + 1, len(context)):
            c_matrix[index[i]][index[j]] += 1

Moreover, you can transform index to a Numpy array and use Numba (or Cython) to build the c_matrix very quickly from index .

Finally, you can transform words to a dictionary (with the string in the current list as the dictionary keys and the index in the current list as the dictionary values) so that indexing will be much faster (constant-time fetch).

The resulting algorithm should be several order of magnitude faster . If this is not enough, then you probably need to replace the matrix c_matrix with a more advanced (but also much more complex) sparse data-structure regarding your needs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM