[英]Counting number of occurences of a word in a array of array without knowing the word in python
我是python編程的新手,希望大家能幫助我。
我必須以這種形式打印語料庫的前十個二元組:
((token),(POS_tag),(token),(POS_tag))
每個令牌的出現率必須大於2。
因此,我完成了pos標記令牌的列表,並與bigrams()
配對。
如何檢查每個單詞(每個單詞對的標記上對應的單詞)的出現次數是否大於2?
您的問題由於各種原因而含糊不清。 首先,標題的措詞確實可以更好。 您沒有很好地解釋您想做什么。 而用“前十個二元組”,您是指文本中的前兩個二元組,還是十個最常見的二元組? 我以為是后者,但如果不是,則只需刪除排序並將文本限制在前11個字即可。
from nltk.util import bigrams
from nltk import tokenize, pos_tag
from collections import defaultdict
counts = defaultdict(int)
counts_pos = defaultdict(int)
with open('twocities.txt') as f:
txt = f.read().lower()
txt = tokenize.word_tokenize(txt)
# Generate the lexical bigrams
bg = bigrams(txt)
# Do part-of-speech tagging and generate
# lexical+pos bigrams
pos = pos_tag(txt)
bg_pos = bigrams(pos)
# Count the number of occurences of each unique bigram
for bigram in bg:
counts[bigram] += 1
for bigram in bg_pos:
counts_pos[bigram] += 1
# Make a list of bigrams sorted on number of occurrences
sortedbigrams = sorted(counts, key = lambda x: counts[x], reverse=True)
sortedbigrams_pos = sorted(counts_pos, key = lambda x: counts_pos[x],
reverse=True)
# Remove bigrams that occur less than the given threshold
print 'Number of bigrams before thresholding: %i, %i' % \
(len(sortedbigrams), len(sortedbigrams_pos))
min_occurence = 2
sortedbigrams = [x for x in sortedbigrams if counts[x] > min_occurence]
sortedbigrams_pos = [x for x in sortedbigrams_pos if
counts_pos[x] > min_occurence]
print 'Number of bigrams after thresholding: %i, %i\n' % \
(len(sortedbigrams), len(sortedbigrams_pos))
# print results
print 'Top 10 lexical bigrams:'
for i in range(10):
print sortedbigrams[i], counts[sortedbigrams[i]]
print '\nTop 10 lexical+pos bigrams:'
for i in range(10):
print sortedbigrams_pos[i], counts_pos[sortedbigrams_pos[i]]
我的nltk安裝僅適用於Python 2.6,如果我將其安裝在2.7上,則將使用Counter而不是defaultdict 。
在“兩個城市的故事”的第一頁上使用此腳本,得到以下輸出:
Top 10 lexical bigrams:
(',', 'and') 17
('it', 'was') 12
('of', 'the') 11
('in', 'the') 11
('was', 'the') 11
(',', 'it') 9
('and', 'the') 6
('with', 'a') 6
('on', 'the') 5
(',', 'we') 4
Top 10 lexical+pos bigrams:
((',', ','), ('and', 'CC')) 17
(('it', 'PRP'), ('was', 'VBD')) 12
(('in', 'IN'), ('the', 'DT')) 11
(('was', 'VBD'), ('the', 'DT')) 11
(('of', 'IN'), ('the', 'DT')) 11
((',', ','), ('it', 'PRP')) 9
(('and', 'CC'), ('the', 'DT')) 6
(('with', 'IN'), ('a', 'DT')) 6
(('on', 'IN'), ('the', 'DT')) 5
(('and', 'CC'), ('a', 'DT')) 4
我假設您的意思是前十個二元組,而我排除了其中一個標記是標點符號的二元組。
import nltk, collections, string
import nltk.book
def bigrams_by_word_freq(tokens, min_freq=3):
def unique(seq): # http://www.peterbe.com/plog/uniqifiers-benchmark
seen = set()
seen_add = seen.add
return [x for x in seq if x not in seen and not seen_add(x)]
punct = set(string.punctuation)
bigrams = unique(nltk.bigrams(tokens))
pos = dict(nltk.pos_tag(tokens))
count = collections.Counter(tokens)
bigrams = filter(lambda (a,b): not punct.intersection({a,b}) and count[a] >= min_freq and count[b] >= min_freq, bigrams)
return tuple((a,pos[a],b,pos[b]) for a,b in bigrams)
text = """Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king's horses and all the king's men
Couldn't put Humpty together again."""
print bigrams_by_word_freq(nltk.wordpunct_tokenize(text), min_freq=2)
print bigrams_by_word_freq(nltk.book.text6)[:10]
輸出:
(('Humpty', 'NNP', 'Dumpty', 'NNP'), ('the', 'DT', 'king', 'NN'))
(('SCENE', 'NNP', '1', 'CD'), ('clop', 'NN', 'clop', 'NN'), ('It', 'PRP', 'is', 'VBZ'), ('is', 'VBZ', 'I', 'PRP'), ('son', 'NN', 'of', 'IN'), ('from', 'IN', 'the', 'DT'), ('the', 'DT', 'castle', 'NN'), ('castle', 'NN', 'of', 'IN'), ('of', 'IN', 'Camelot', 'NNP'), ('King', 'NNP', 'of', 'IN'))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.