简体   繁体   English

计算python中的二元频率

[英]Counting bigram frequencies in python

Assume that i have a data that looks like假设我有一个看起来像的数据

['<s>', 'I' , '<s>', 'I', 'UNK', '</s>']

I would like to get the number of bigram that occurs only once, so我想得到只出现一次的二元数,所以

n1 == ('I', '<s>'), ('I', 'UNK'), ('UNK', '</s>')
len(n1) == 3 

and number of bigram that occurs twice和出现两次的二元数

n2 == ('<s>', 'I')
len(n2) == 1

I am thinking of storing the first word as sen[i] and the next word as sen[i + 1] but I am not sure if this is the right approach.我正在考虑将第一个单词存储为 sen[i] 并将下一个单词存储为 sen[i + 1] 但我不确定这是否是正确的方法。

Considering your list:-考虑您的清单:-

lis = ['<s>', 'I' , '<s>', 'I', 'UNK', '</s>']

loop over the list to generate the tuples of bigrams and keep getting their frequency into the dictionary like this:-循环遍历列表以生成二元组并继续将它们的频率输入字典,如下所示:-

bigram_freq = {}
length = len(lis)
for i in range(length-1):
    bigram = (lis[i], lis[i+1])
    if bigram not in bigram_freq:
        bigram_freq[bigram] = 0
    bigram_freq[bigram] += 1

Now, collect the bigrams with frequency = 1 and frequency = 2 like this:-现在,像这样收集频率 = 1 和频率 = 2 的二元组:-

bigrams_with_frequency_one = 0
bigrams_with_frequency_two = 0
for bigram in bigram_freq:
    if bigram_freq[bigram] == 1:
        bigrams_with_frequency_one += 1
    elif bigram_freq[bigram] == 2:
        bigrams_with_frequency_two += 1

you have bigrams_with_frequency_one and bigrams_with_frequency_two as your results.你有 bigrams_with_frequency_one 和 bigrams_with_frequency_two 作为你的结果。 I hope it helps!我希望它有帮助!

You can try this:你可以试试这个:

my_list = ['<s>', 'I' , '<s>', 'I', 'UNK', '</s>']

bigrams = [(l[i-1], l[i]) for i in range(1, len(my_list))]
print(bigrams)
# [('<s>', 'I'), ('I', '<s>'), ('<s>', 'I'), ('I', 'UNK'), ('UNK', '</s>')]

d = {}

for c in set(bigrams):
    count = bigrams.count(c)
    d.setdefault(count, []).append(c)

print(d)
# {1: [('I', '<s>'), ('UNK', '</s>'), ('I', 'UNK')], 2: [('<s>', 'I')]}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM