I'm a newbie in python programming and I hope that someone of you could help me.
I have to print the first ten bigrams of a corpus in this form:
((token),(POS_tag),(token),(POS_tag))
where the occurrency of each token must be greater than 2.
So I've done a list of pos tagged tokens and paired with themselves with bigrams()
.
How I can check if the number of occurrences of each word (that corresponding on token of each pair) is >2?
Your question is vague for various reasons. For one, the title could really be worded much better. You don't explain very well what you want to do. And with "first ten bigrams", do you literally mean the first bigrams in the text, or the ten most frequent ones? I assumed it was the latter, but if it isn't, just remove the sorting and limit your text to the first eleven words.
from nltk.util import bigrams
from nltk import tokenize, pos_tag
from collections import defaultdict
counts = defaultdict(int)
counts_pos = defaultdict(int)
with open('twocities.txt') as f:
txt = f.read().lower()
txt = tokenize.word_tokenize(txt)
# Generate the lexical bigrams
bg = bigrams(txt)
# Do part-of-speech tagging and generate
# lexical+pos bigrams
pos = pos_tag(txt)
bg_pos = bigrams(pos)
# Count the number of occurences of each unique bigram
for bigram in bg:
counts[bigram] += 1
for bigram in bg_pos:
counts_pos[bigram] += 1
# Make a list of bigrams sorted on number of occurrences
sortedbigrams = sorted(counts, key = lambda x: counts[x], reverse=True)
sortedbigrams_pos = sorted(counts_pos, key = lambda x: counts_pos[x],
reverse=True)
# Remove bigrams that occur less than the given threshold
print 'Number of bigrams before thresholding: %i, %i' % \
(len(sortedbigrams), len(sortedbigrams_pos))
min_occurence = 2
sortedbigrams = [x for x in sortedbigrams if counts[x] > min_occurence]
sortedbigrams_pos = [x for x in sortedbigrams_pos if
counts_pos[x] > min_occurence]
print 'Number of bigrams after thresholding: %i, %i\n' % \
(len(sortedbigrams), len(sortedbigrams_pos))
# print results
print 'Top 10 lexical bigrams:'
for i in range(10):
print sortedbigrams[i], counts[sortedbigrams[i]]
print '\nTop 10 lexical+pos bigrams:'
for i in range(10):
print sortedbigrams_pos[i], counts_pos[sortedbigrams_pos[i]]
My nltk installation is only for Python 2.6, if I had it installed on 2.7 I'd use a Counter instead of a defaultdict .
Using this script on the first page of A Tale Of Two Cities , I get the following output:
Top 10 lexical bigrams:
(',', 'and') 17
('it', 'was') 12
('of', 'the') 11
('in', 'the') 11
('was', 'the') 11
(',', 'it') 9
('and', 'the') 6
('with', 'a') 6
('on', 'the') 5
(',', 'we') 4
Top 10 lexical+pos bigrams:
((',', ','), ('and', 'CC')) 17
(('it', 'PRP'), ('was', 'VBD')) 12
(('in', 'IN'), ('the', 'DT')) 11
(('was', 'VBD'), ('the', 'DT')) 11
(('of', 'IN'), ('the', 'DT')) 11
((',', ','), ('it', 'PRP')) 9
(('and', 'CC'), ('the', 'DT')) 6
(('with', 'IN'), ('a', 'DT')) 6
(('on', 'IN'), ('the', 'DT')) 5
(('and', 'CC'), ('a', 'DT')) 4
I assumed you meant the first ten bigrams, and I excluded bigrams where one of the tokens is punctuation.
import nltk, collections, string
import nltk.book
def bigrams_by_word_freq(tokens, min_freq=3):
def unique(seq): # http://www.peterbe.com/plog/uniqifiers-benchmark
seen = set()
seen_add = seen.add
return [x for x in seq if x not in seen and not seen_add(x)]
punct = set(string.punctuation)
bigrams = unique(nltk.bigrams(tokens))
pos = dict(nltk.pos_tag(tokens))
count = collections.Counter(tokens)
bigrams = filter(lambda (a,b): not punct.intersection({a,b}) and count[a] >= min_freq and count[b] >= min_freq, bigrams)
return tuple((a,pos[a],b,pos[b]) for a,b in bigrams)
text = """Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king's horses and all the king's men
Couldn't put Humpty together again."""
print bigrams_by_word_freq(nltk.wordpunct_tokenize(text), min_freq=2)
print bigrams_by_word_freq(nltk.book.text6)[:10]
Output:
(('Humpty', 'NNP', 'Dumpty', 'NNP'), ('the', 'DT', 'king', 'NN'))
(('SCENE', 'NNP', '1', 'CD'), ('clop', 'NN', 'clop', 'NN'), ('It', 'PRP', 'is', 'VBZ'), ('is', 'VBZ', 'I', 'PRP'), ('son', 'NN', 'of', 'IN'), ('from', 'IN', 'the', 'DT'), ('the', 'DT', 'castle', 'NN'), ('castle', 'NN', 'of', 'IN'), ('of', 'IN', 'Camelot', 'NNP'), ('King', 'NNP', 'of', 'IN'))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.