[英]Python- Ignore numbers and symbols in Bigram frequency
Im trying to find the Bi-gram frequency from a text from a txt files. 我试图从txt文件的文本中找到Bi-gram的频率。 So far it works but it counts numbers and symbols.Here is the code I have: 到目前为止,它可以工作,但它可以计算数字和符号。这是我的代码:
import nltk
from nltk.collocations import *
import prettytable
file = open('tweets.txt').read()
tokens = nltk.word_tokenize(file)
pt = prettytable.PrettyTable(['Words', 'Counts'])
pt.align['Words'] = 'l'
pt.align['Counts'] = 'r'
bgs = nltk.bigrams(tokens)
fdist = nltk.FreqDist(bgs)
for row in fdist.most_common(100):
pt.add_row(row)
print pt
Below is the code output:
+------------------------------------+--------+
| Words | Counts |
+------------------------------------+--------+
| ('https', ':') | 1615 |
| ('!', '#') | 445 |
| ('Thank', 'you') | 386 |
| ('.', '``') | 358 |
| ('.', 'I') | 354 |
| ('.', 'Thank') | 337 |
| ('``', '@') | 320 |
| ('&', 'amp') | 290 |
Is there a way to ignore numbers and symbols ( like !,.,?,:)? 有没有办法忽略数字和符号(例如!,。,?,::)? Since the text are tweets, I want to ignore numbers and symbols, except for the #'s and @'s 由于文本是推文,因此我想忽略数字和符号,但#和@除外。
An fdist for bigrams is a tuple of tuples containing a bigram tuple and a count integer, so we need to access the bigram tuple and keep only the ones we need in addition to the bigram's count. 二元组的fdist是包含二元组元组和一个计数整数的元组的元组,因此我们需要访问二元组元组,并仅保留除二元组的计数之外的所需元组。 Try this: 尝试这个:
import nltk
from nltk.probability import FreqDist
from nltk.util import ngrams
from pprint import pprint
def filter_most_common_bigrams(mc_bigrams_counts):
filtered_mc_bigrams_counts = []
for mc_bigram_count in mc_bigrams_counts:
bigram, count = mc_bigram_count
#print (bigram, count)
if all([gram.isalpha() for gram in bigram]) or bigram[0] in "#@" and bigram[1].isalpha():
filtered_mc_bigrams_counts.append((bigram, count))
return tuple(filtered_mc_bigrams_counts)
text = """Is there a way to ignore numbers and symbols ( like !,.,?,:)?
Since the text are tweets, I want to ignore numbers and symbols, except for the #'s and @'s
https: !# . Thank you . `` 12 hi . 1st place 1 love 13 in @twitter # twitter"""
tokenized_text = nltk.word_tokenize(text)
bigrams = ngrams(tokenized_text, 2)
fdist = FreqDist(bigrams)
mc_bigrams_counts = fdist.most_common(100)
pprint (filter_most_common_bigrams(mc_bigrams_counts))
The key piece of code is: 关键代码是:
if all([gram.isalpha() for gram in bigram]) or bigram[0] in "#@" and bigram[1].isalpha():
filtered_mc_bigrams_counts.append((bigram, count))
This checks that all the 1grams in the bigram are letters, or, alternatively, that the first bigram is a # or @ symbol and the second bigram is composed of letters. 这将检查二元组中的所有1grams是否都是字母,或者检查第一个二元组是#或@符号,第二个二元组由字母组成。 It only appends those that satisfy these conditions, and does so within a tuple containing the bigram's fdist count. 它仅附加那些满足这些条件的条件,并在包含bigram的fdist计数的元组中附加。
Results: 结果:
((('to', 'ignore'), 2),
(('and', 'symbols'), 2),
(('ignore', 'numbers'), 2),
(('numbers', 'and'), 2),
(('for', 'the'), 1),
(('@', 'twitter'), 1),
(('Is', 'there'), 1),
(('text', 'are'), 1),
(('a', 'way'), 1),
(('Thank', 'you'), 1),
(('want', 'to'), 1),
(('Since', 'the'), 1),
(('I', 'want'), 1),
(('#', 'twitter'), 1),
(('the', 'text'), 1),
(('are', 'tweets'), 1),
(('way', 'to'), 1),
(('except', 'for'), 1),
(('there', 'a'), 1))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.