简体   繁体   English

在不知道python中单词的情况下计算数组中单词的出现次数

[英]Counting number of occurences of a word in a array of array without knowing the word in python

I'm a newbie in python programming and I hope that someone of you could help me. 我是python编程的新手,希望大家能帮助我。

I have to print the first ten bigrams of a corpus in this form: 我必须以这种形式打印语料库的前十个二元组:

((token),(POS_tag),(token),(POS_tag))

where the occurrency of each token must be greater than 2. 每个令牌的出现率必须大于2。

So I've done a list of pos tagged tokens and paired with themselves with bigrams() . 因此,我完成了pos标记令牌的列表,并与bigrams()配对。

How I can check if the number of occurrences of each word (that corresponding on token of each pair) is >2? 如何检查每个单词(每个单词对的标记上对应的单词)的出现次数是否大于2?

Your question is vague for various reasons. 您的问题由于各种原因而含糊不清。 For one, the title could really be worded much better. 首先,标题的措词确实可以更好。 You don't explain very well what you want to do. 您没有很好地解释您想做什么。 And with "first ten bigrams", do you literally mean the first bigrams in the text, or the ten most frequent ones? 而用“前十个二元组”,您是指文本中的前两个二元组,还是十个最常见的二元组? I assumed it was the latter, but if it isn't, just remove the sorting and limit your text to the first eleven words. 我以为是后者,但如果不是,则只需删除排序并将文本限制在前11个字即可。

from nltk.util import bigrams
from nltk import tokenize, pos_tag
from collections import defaultdict

counts = defaultdict(int)
counts_pos = defaultdict(int)

with open('twocities.txt') as f:
    txt = f.read().lower()
    txt = tokenize.word_tokenize(txt)

    # Generate the lexical bigrams
    bg = bigrams(txt)

    # Do part-of-speech tagging and generate 
    # lexical+pos bigrams
    pos = pos_tag(txt)
    bg_pos = bigrams(pos)

    # Count the number of occurences of each unique bigram
    for bigram in bg:
        counts[bigram] += 1

    for bigram in bg_pos:
        counts_pos[bigram] += 1

# Make a list of bigrams sorted on number of occurrences
sortedbigrams = sorted(counts, key = lambda x: counts[x], reverse=True)
sortedbigrams_pos = sorted(counts_pos, key = lambda x: counts_pos[x],
                           reverse=True)

# Remove bigrams that occur less than the given threshold
print 'Number of bigrams before thresholding: %i, %i' % \
       (len(sortedbigrams), len(sortedbigrams_pos))

min_occurence = 2

sortedbigrams = [x for x in sortedbigrams if counts[x] > min_occurence]
sortedbigrams_pos = [x for x in sortedbigrams_pos if
            counts_pos[x] > min_occurence]
print 'Number of bigrams after thresholding: %i, %i\n' % \
       (len(sortedbigrams), len(sortedbigrams_pos))

# print results
print 'Top 10 lexical bigrams:'
for i in range(10):
    print sortedbigrams[i], counts[sortedbigrams[i]]

print '\nTop 10 lexical+pos bigrams:'
for i in range(10):
    print sortedbigrams_pos[i], counts_pos[sortedbigrams_pos[i]]

My nltk installation is only for Python 2.6, if I had it installed on 2.7 I'd use a Counter instead of a defaultdict . 我的nltk安装仅适用于Python 2.6,如果我将其安装在2.7上,则将使用Counter而不是defaultdict

Using this script on the first page of A Tale Of Two Cities , I get the following output: “两个城市的故事”的第一页上使用此脚本,得到以下输出:

Top 10 lexical bigrams:
(',', 'and') 17
('it', 'was') 12
('of', 'the') 11
('in', 'the') 11
('was', 'the') 11
(',', 'it') 9
('and', 'the') 6
('with', 'a') 6
('on', 'the') 5
(',', 'we') 4

Top 10 lexical+pos bigrams:
((',', ','), ('and', 'CC')) 17
(('it', 'PRP'), ('was', 'VBD')) 12
(('in', 'IN'), ('the', 'DT')) 11
(('was', 'VBD'), ('the', 'DT')) 11
(('of', 'IN'), ('the', 'DT')) 11
((',', ','), ('it', 'PRP')) 9
(('and', 'CC'), ('the', 'DT')) 6
(('with', 'IN'), ('a', 'DT')) 6
(('on', 'IN'), ('the', 'DT')) 5
(('and', 'CC'), ('a', 'DT')) 4

I assumed you meant the first ten bigrams, and I excluded bigrams where one of the tokens is punctuation. 我假设您的意思是前十个二元组,而我排除了其中一个标记是标点符号的二元组。

import nltk, collections, string
import nltk.book

def bigrams_by_word_freq(tokens, min_freq=3):
    def unique(seq): # http://www.peterbe.com/plog/uniqifiers-benchmark
        seen = set()
        seen_add = seen.add
        return [x for x in seq if x not in seen and not seen_add(x)]

    punct = set(string.punctuation)
    bigrams = unique(nltk.bigrams(tokens))
    pos = dict(nltk.pos_tag(tokens))
    count = collections.Counter(tokens)

    bigrams = filter(lambda (a,b): not punct.intersection({a,b}) and count[a] >= min_freq and count[b] >= min_freq, bigrams)

    return tuple((a,pos[a],b,pos[b]) for a,b in bigrams)



text = """Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king's horses and all the king's men
Couldn't put Humpty together again."""

print bigrams_by_word_freq(nltk.wordpunct_tokenize(text), min_freq=2)

print bigrams_by_word_freq(nltk.book.text6)[:10]

Output: 输出:

(('Humpty', 'NNP', 'Dumpty', 'NNP'), ('the', 'DT', 'king', 'NN'))
(('SCENE', 'NNP', '1', 'CD'), ('clop', 'NN', 'clop', 'NN'), ('It', 'PRP', 'is', 'VBZ'), ('is', 'VBZ', 'I', 'PRP'), ('son', 'NN', 'of', 'IN'), ('from', 'IN', 'the', 'DT'), ('the', 'DT', 'castle', 'NN'), ('castle', 'NN', 'of', 'IN'), ('of', 'IN', 'Camelot', 'NNP'), ('King', 'NNP', 'of', 'IN'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM