[英]Counting bi-gram frequencies
I've written a piece of code that essentially counts word frequencies and inserts them into an ARFF file for use with weka.我编写了一段代码,它基本上计算词频并将它们插入到一个 ARFF 文件中以供 weka 使用。 I'd like to alter it so that it can count bi-gram frequencies, ie pairs of words instead of single words although my attempts have proved unsuccessful at best.我想改变它,以便它可以计算二元词的频率,即成对的词而不是单个词,尽管我的尝试充其量证明是不成功的。
I realise there's alot to look at but any help on this is greatly appreciated.我意识到有很多东西要看,但非常感谢任何帮助。 Here's my code:这是我的代码:
import re
import nltk
# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')
# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
word_list = [punctuation.sub("", word) for word in word_list]
word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]
# create dictionary of word:frequency pairs
freq_dic = {}
for word in word_list2:
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
print '-'*30
print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]
# sort by val or frequency
freq_list2.sort(reverse=True)
freq_list3 = list(freq_list2)
# display result as top 10 most frequent words
freq_list4 =[]
freq_list4=freq_list3[:10]
words = []
for item in freq_list4:
a = str(item[1])
a = a.lower()
words.append(a)
f = open(filename)
newlist = []
for line in f:
line = punctuation.sub("", line)
line = line.lower()
newlist.append(line)
f2 = open('Lines.txt','w')
newlist2= []
for line in newlist:
line = line.split()
newlist2.append(line)
f2.write(str(line))
f2.write("\n")
print newlist2
# ARFF Creation
arff = open('output.arff','w')
arff.write('@RELATION wordfrequency\n\n')
for word in words:
arff.write('@ATTRIBUTE ')
arff.write(str(word))
arff.write(' numeric\n')
arff.write('@ATTRIBUTE class {endofworld, notendofworld}\n\n')
arff.write('@DATA\n')
# Counting word frequencies for each verse
for line in newlist2:
word_occurrences = str("")
for word in words:
matches = int(0)
for item in line:
if str(item) == str(word):
matches = matches + int(1)
else:
continue
word_occurrences = word_occurrences + str(matches) + ","
word_occurrences = word_occurrences + "endofworld"
arff.write(word_occurrences)
arff.write("\n")
print words
This should get you started:这应该让你开始:
def bigrams(words):
wprev = None
for w in words:
yield (wprev, w)
wprev = w
Note that the first bigram is (None, w1)
where w1
is the first word, so you have a special bigram that marks start-of-text.请注意,第一个二元组是(None, w1)
其中w1
是第一个单词,因此您有一个特殊的二元组来标记文本的开头。 If you also want an end-of-text bigram, add yield (wprev, None)
after the loop.如果您还想要一个文本结尾二元组,请在循环后添加yield (wprev, None)
。
Generalized to n-grams with optional padding, also uses defaultdict(int)
for frequencies, to work in 2.6:推广到带有可选填充的 n-gram,也使用defaultdict(int)
作为频率,在 2.6 中工作:
from collections import defaultdict
def ngrams(words, n=2, padding=False):
"Compute n-grams with optional padding"
pad = [] if not padding else [None]*(n-1)
grams = pad + words + pad
return (tuple(grams[i:i+n]) for i in range(0, len(grams) - (n - 1)))
# grab n-grams
words = ['the','cat','sat','on','the','dog','on','the','cat']
for size, padding in ((3, 0), (4, 0), (2, 1)):
print '\n%d-grams padding=%d' % (size, padding)
print list(ngrams(words, size, padding))
# show frequency
counts = defaultdict(int)
for ng in ngrams(words, 2, False):
counts[ng] += 1
print '\nfrequencies of bigrams:'
for c, ng in sorted(((c, ng) for ng, c in counts.iteritems()), reverse=True):
print c, ng
Output: Output:
3-grams padding=0
[('the', 'cat', 'sat'), ('cat', 'sat', 'on'), ('sat', 'on', 'the'),
('on', 'the', 'dog'), ('the', 'dog', 'on'), ('dog', 'on', 'the'),
('on', 'the', 'cat')]
4-grams padding=0
[('the', 'cat', 'sat', 'on'), ('cat', 'sat', 'on', 'the'),
('sat', 'on', 'the', 'dog'), ('on', 'the', 'dog', 'on'),
('the', 'dog', 'on', 'the'), ('dog', 'on', 'the', 'cat')]
2-grams padding=1
[(None, 'the'), ('the', 'cat'), ('cat', 'sat'), ('sat', 'on'),
('on', 'the'), ('the', 'dog'), ('dog', 'on'), ('on', 'the'),
('the', 'cat'), ('cat', None)]
frequencies of bigrams:
2 ('the', 'cat')
2 ('on', 'the')
1 ('the', 'dog')
1 ('sat', 'on')
1 ('dog', 'on')
1 ('cat', 'sat')
I've rewritten the first bit for you, because it's icky.我已经为你重写了第一部分,因为它很恶心。 Points to note:注意事项:
collections.Counter
is great! collections.Counter
.计数器很棒!OK, code:好的,代码:
import re
import nltk
import collections
# Quran subset
filename = raw_input('Enter name of file to convert to ARFF with extension, eg. name.txt: ')
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]')
# create list of lower case words
word_list = re.split('\s+', open(filename).read().lower())
print 'Words in text:', len(word_list)
words = (punctuation.sub("", word).strip() for word in word_list)
words = (word for word in words if word not in ntlk.corpus.stopwords.words('english'))
# create dictionary of word:frequency pairs
frequencies = collections.Counter(words)
print '-'*30
print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
print frequencies
# display result as top 10 most frequent words
print frequencies.most_common(10)
[word for word, frequency in frequencies.most_common(10)]
Life is much more easier if you start using NLTK's FreqDist function to do the counting.如果您开始使用 NLTK 的 FreqDist function 进行计数,生活会轻松得多。 Also NLTK has bigram feature. NLTK 还具有二元组功能。 Examples for both of them are in the following page.它们的示例都在下一页中。
http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html http://nltk.googlecode.com/svn/trunk/doc/book/ch01.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.