I am new to Python. I used collections.Counter to count the most frequent bigrams in a text:
import sys, codecs
import nltk, collections
from nltk.util import ngrams
InputFile = codecs.open("testin.txt", 'r', 'utf-8')
text=InputFile.read().lower()
tokens = text.split()
bi_tokens = ngrams(tokens, 2)
bi_freq = collections.Counter(bi_tokens)
If I use:
for row in bi_freq.most_common(100):
print (row)
The result appears as:
(('star', 'wars'), 29)
(('blu', 'ray'), 21)
If I use:
for row in bi_freq.most_common(1000):
print (row[0], "\t", row[1])
The result appears a bit cleaner as:
('star', 'wars') 29
('blu', 'ray') 21
I would like to get to:
star wars 29
blu ray 21
which I would import into a spreadsheet in two columns with tab as a separator.
So my question is: how do I access each tuple value, when the tuple is a key in a dictionary, so that I can concatenate them into a string? Thanks in advance.
Edit: I did this:
for row in bi_freq.most_common(100):
wordlist_in_bigram = row[0]
print (wordlist_in_bigram[0],wordlist_in_bigram[1],"\t", row[1])
And the result seems to be what I wanted:
star wars 29
blu ray 21
Is this a good solution? Thanks
Use join()
to create a delimited string from a sequence.
for bigram, c in b_freq.most_common(1000):
print(" ".join(bigram), c)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.