简体   繁体   English

单词和短语的Python共现矩阵

[英]Python Co-occurrence matrix of words and phrases

I'm working with two text files. 我正在处理两个文本文件。 One contains a list of 58 words (L1), and the other one contains 1173 phrases (L2). 一个包含58个单词(L1)的列表,另一个包含1173个短语(L2)。 I want to check for i in range(len(L1)) and for j in range(len(L1)) the co-occurrence in L2 . 我想检查for i in range(len(L1)) for j in range(len(L1))L2同时出现。

For example: 例如:

L1 = ['b', 'c', 'd', 'e', 't', 'w', 'x', 'y', 'z']
L2 = ['the onion', 'be your self', 'great zoo', 'x men', 'corn day']

for i in range(len(L1)):
    for j in range(len(L1)):
        for s in range(len(L2)):
            if L1[i] in L2[s] and L1[j] in L2[s]:
                output = L1[i], L1[j], L2[s]
                print output

Output (example 'be your self' from L2 ): 输出(例如来自L2 'be your self' ):

('b', 'b', 'be your self')
('b', 'e', 'be your self')
('b', 'y', 'be your self')
('e', 'b', 'be your self')
('e', 'e', 'be your self')
('e', 'y', 'be your self')
('y', 'b', 'be your self')
('y', 'e', 'be your self')
('y', 'y', 'be your self')

The output shows what I want, but in order visualize data, I need also to return the times L1[j] concurs with L1[i] . 输出显示了我想要的内容,但是为了可视化数据,我还需要返回L1[j]L1[i]

For example: 例如:

  b e y
b 1 1 1
e 1 2 1
y 1 1 1

Should I use pandas or numpy in order to return this result? 我应该使用pandas还是numpy来返回此结果?

I found this question about co-occurrence matrix but I didn't find and specific answer. 我发现了有关共现矩阵的问题,但没有找到具体答案。 efficient algorithm for finding co occurrence matrix of phrases 查找短语共现矩阵的有效算法

Thanks! 谢谢!

Here's a solution that uses itertools.product . 这是使用itertools.product的解决方案。 This should time significantly better than the accepted solution (if that's an issue). 这应该比公认的解决方案好得多(如果有问题)。

from itertools import product
from operator import mul

L1 = ['b', 'c', 'd', 'e', 't', 'w', 'x', 'y', 'z']
L2 = ['the onion', 'be your self', 'great zoo', 'x men', 'corn day']

phrase_map = {}

for phrase in L2:
    word_count = {word: phrase.count(word) for word in L1 if word in phrase}

    occurrence_map = {}
    for perm in product(word_count, repeat=2):
        occurrence_map[perm] = reduce(mul, (word_count[key] for key in perm), 1)

    phrase_map[phrase] = occurrence_map

From my timings, this is 2-4 times faster in Python 3 (there's probably less of an improvement in Python 2). 从我的角度来看,这在Python 3中快了2-4倍(在Python 2中可能没有多少改进)。 Also, in Python 3, you need to import reduce from functools . 另外,在Python 3中,您需要从functools导入reduce

Edit: Note that, while this implementation is relatively simple, there are obvious inefficiencies. 编辑:请注意,尽管此实现相对简单,但效率显然很低。 For example, we know that the corresponding output will be symmetric and this solution does not exploit that. 例如,我们知道相应的输出将是对称的,并且此解决方案不会对此加以利用。 Using combinations_with_replacements instead of product will generate only the entries in the upper triangular part of your output matrix. 使用combinations_with_replacements而不是product只会在输出矩阵的上三角部分中生成条目。 Thus, we can improve of the above solution by doing: 因此,我们可以通过以下方法改进上述解决方案:

from itertools import combinations_with_replacement

L1 = ['b', 'c', 'd', 'e', 't', 'w', 'x', 'y', 'z']
L2 = ['the onion', 'be your self', 'great zoo', 'x men', 'corn day']

phrase_map = {}

for phrase in L2:
    word_count = {word: phrase.count(word) for word in L1 if word in phrase}

    occurrence_map = {}
    for x, y in combinations_with_replacement(word_count, 2):
        occurrence_map[(x,y)] = occurrence_map[(y,x)] = \
            word_count[x] * word_count[y]

    phrase_map[phrase] = occurrence_map

return phrase_map

As expected, this version takes half as long as the previous version. 不出所料,此版本花费的时间是先前版本的一半。 Note that this version relies on restricting yourself to pairs of two elements while the previous version did not. 请注意,此版本依赖于将自己限制为两个元素对,而先前版本则不行。

Note that around 15-20% of the running time can be cut out if the line 请注意,如果该生产线可以减少大约15-20%的运行时间

 occurrence_map[(x,y)] = occurrence_map[(y,x)] = ...

is changed to 更改为

occurrence_map[(x,y)] = ...

but this may be less-than-ideal depending on how you are using this mapping in the future. 但这可能不太理想,具体取决于您将来使用此映射的方式。

Ok why don't you try this? 好的,您为什么不尝试呢?

from collections import defaultdict

L1 = ['b', 'c', 'd', 'e', 't', 'w', 'x', 'y', 'z']
L2 = ['the onion', 'be your self', 'great zoo', 'x men', 'corn day', 'yes be your self']

d = dict.fromkeys(L2)

for s, phrase in enumerate(L2):
    d[phrase] = defaultdict(int)
    for letter1 in phrase:
        for letter2 in phrase:
            if letter1 in L1 and letter2 in L1:
                output = letter1, letter2, phrase
                print output
                key = (letter1, letter2)
                d[phrase][key] += 1

print d

To catch the duplicate values you need to traverse the phrase, not the list L1, and then see if each letter in the phrase is in L1 (in other words swap the in expression around). 要捕获重复的值,您需要遍历短语而不是列表L1,然后查看短语中的每个字母是否在L1中(换句话说,将in表达式交换)。

Output: 输出:

{
'x men': defaultdict(<type 'int'>, {('e', 'e'): 1, ('e', 'x'): 1, ('x', 'x'): 1, ('x', 'e'): 1}),
'great zoo': defaultdict(<type 'int'>, {('t', 't'): 1, ('t', 'z'): 1, ('e', 'e'): 1, ('e', 'z'): 1, ('t', 'e'): 1, ('z', 'e'): 1, ('z', 't'): 1, ('e', 't'): 1, ('z', 'z'): 1}),
'the onion': defaultdict(<type 'int'>, {('e', 't'): 1, ('t', 'e'): 1, ('e', 'e'): 1, ('t', 't'): 1}),
'be your self': defaultdict(<type 'int'>, {('b', 'y'): 1, ('b', 'b'): 1, ('e', 'e'): 4, ('y', 'e'): 2, ('y', 'b'): 1, ('y', 'y'): 1, ('e', 'b'): 2, ('e', 'y'): 2, ('b', 'e'): 2}),
'corn day': defaultdict(<type 'int'>, {('d', 'd'): 1, ('y', 'd'): 1, ('d', 'y'): 1, ('y', 'y'): 1, ('y', 'c'): 1, ('c', 'c'): 1, ('c', 'y'): 1, ('c', 'd'): 1, ('d', 'c'): 1}),
'yes be your self': defaultdict(<type 'int'>, {('b', 'y'): 2, ('b', 'b'): 1, ('e', 'e'): 9, ('y', 'e'): 6, ('y', 'b'): 2, ('y', 'y'): 4, ('e', 'b'): 3, ('e', 'y'): 6, ('b', 'e'): 3})
}

You can try below code. 您可以尝试以下代码。

import collections, numpy
    tokens=['He','is','not','lazy','intelligent','smart']
    j=0
    a=np.zeros((len(tokens),len(tokens)))
    for pos,token in enumerate(tokens):
        j+=pos+1
        for token1 in tokens[pos+1:]:
            count = 0
            for sentence in [['He','is','not','lazy','He','is','intelligent','He','is','smart'] ]:
                    occurrences1 = [i for i,e in enumerate(sentence) if e == token1]
                    #print(token1,occurrences1)
                    occurrences2 = [i for i,e in enumerate(sentence) if e == token]
                    #print(token,occurrences2)
                    new1= np.repeat(occurrences1,len(occurrences2))
                    new2= np.asarray(occurrences2*len(occurrences1))
                    final_new= np.subtract(new1,new2)
                    final_abs_diff = np.absolute(final_new)
                    final_counts = collections.Counter(final_abs_diff)
                    count_1=final_counts[1]
                    count_2=final_counts[2]
                    count_0=final_counts[0]
                    count=count_1+count_2+count_0
            a[pos][j]=count
            #print(token,' ',pos,' ',token1,' ',j,' ',count)
            j+=1
        j=0

    final_mat = a.T+a
    print(final_mat)

Output is : 输出为:

[[0. 4. 2. 1. 2. 1.]
 [4. 0. 1. 2. 2. 1.]
 [2. 1. 0. 1. 0. 0.]
 [1. 2. 1. 0. 0. 0.]
 [2. 2. 0. 0. 0. 0.]
 [1. 1. 0. 0. 0. 0.]]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM