[英]How to find total number of positive and negative words from a text?
I want to find the total number of positive and negative words matched from a given text. 我想查找给定文本中匹配的肯定和否定单词的总数。 I have list of positive words in positive.txt
file and list of negative words in negative.txt
file. 我在positive.txt
文件中有肯定词列表,在negative.txt
文件中有否定词列表。 If a word is matched from positive word list, then I want a simple integer variable where the value is incremented by 1, same for the negative matched word. 如果一个单词是从肯定单词列表中匹配的,那么我想要一个简单的整数变量,该变量的值增加1,与否定匹配单词相同。 From my given code I am getting a paragraph which is under @class=[story-hed]
. 从我给定的代码中,我得到了一个@class=[story-hed]
下的段落。 This is the text which I want to compare with the list of positive and negative words as well as total count of words. 这是我要与肯定和否定单词列表以及单词总数进行比较的文本。 My code is, 我的代码是
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dawn.items import DawnItem
class dawnSpider(BaseSpider):
name = "dawn"
allowed_domains = ["dawn.com"]
start_urls = [
"http://dawn.com/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//h3[@class="story-hed"]//a/text()').extract()
items=[]
for site in sites:
item=DawnItem()
item['title']=site
items.append(item)
return items
The standalone code below could do the trick: 下面的独立代码可以达到目的:
from collections import Counter
def readwords( filename ):
f = open(filename)
words = [ line.rstrip() for line in f.readlines()]
return words
positive = readwords('positive.txt')
negative = readwords('negative.txt')
paragraph = 'this is really bad and in fact awesome. really awesome.'
count = Counter(paragraph.split())
pos = 0
neg = 0
for key, val in count.iteritems():
key = key.rstrip('.,?!\n') # removing possible punctuation signs
if key in positive:
pos += val
if key in negative:
neg += val
print pos, neg
Here is what I have in the two input files: 这是两个输入文件中的内容:
positive.txt: positive.txt:
good
awesome
negative.txt: negative.txt:
bad
ugly
and the output is: 2 1 输出为:2 1
To implement this in scrapy, you might want to use an item pipeline http://doc.scrapy.org/en/latest/topics/item-pipeline.html 要在草率地实现这一点,您可能需要使用项目管道http://doc.scrapy.org/en/latest/topics/item-pipeline.html
First you may want to read the files. 首先,您可能需要阅读文件。 Assuming you have a word per line you can read all the words with the following code: 假设每行有一个单词,则可以使用以下代码读取所有单词:
postive = [l.strip() for l in open("possitive.txt")]
Once done, you can create a dict which will hold the word as key and the count as value. 完成后,您可以创建一个字典,将单词作为键,将计数作为值。 For initiating the dict to zero you can use: 要将dict初始化为零,可以使用:
positive_count = dict.fromkeys(postive, 0)
Finally you hust iterate all the items and increment the count if world is found: 最后,如果发现世界,则必须迭代所有项并增加计数:
for item in items:
if item in positive_count:
postive_count[item] +=1
And finally you can print the results with: 最后,您可以使用以下命令打印结果:
for item, value in postive_counts.iteritems():
print "Word %s count %d" % (item, value)
For negative will be the same, just ommited to simplify the answer. 对于否定将是相同的,只是省略了简化的答案。
This depends on the size of the word lists. 这取决于单词列表的大小。 If they are smallish (less than a few kb), then read them into a list: 如果它们很小(少于几个kb),则将它们读入列表:
with open(positive_wordlist_file_name) as fd:
positive_words = [line.strip() for line in fd]
Once you have two word lists, you can then got through the text with them - line by line if you can. 一旦有了两个单词列表,就可以与它们一起遍历文本(如果可以的话)。 Split those into words, and then use the "in" operator to check them in the list. 将其拆分为单词,然后使用“ in”运算符在列表中进行检查。 I'd use a couple of co-routines in a class for it: 我会在一个类中使用几个协同例程:
class WordCounter:
# You can probably read word lists and store them here
def positive_word_counter(self):
"""Co-routine that will count positive words. I'll leave it to reader
to make a similar negative word one"""
self.positive_words = 0
while True:
words = yield
matched = [word for word in words if word in self.positive_words]
self.positive_words += len(matched)
def read_text(text):
"""Text - some iterable of lines - an file handle, or list or whatever."""
#expand on this split with other word separators - or use re.split with the word boundary instead
line_words = (line.strip().split(' ,') for line in text)
#Create and prime coroutines
positive_counter = self.positive_word_counter()
positive_counter.next()
negative_counter = self.negative_word_counter()
negative_counter.next()
#Now fire it in
[[positive_counter.next(words), negative_counter.next(words)] for words in line_words]
#You should now be able to read positive/negative words from this object
for key, val in count.iteritems():
==>only it works in Python 3 below version if you're using python 3 above versions use for key, val in count.iteritems():
==>仅在使用Python 3以上版本的情况下,它才在Python 3以下版本中工作
for key, val in count.item()
key = key.rstrip('.,?!\n') # removing possible punctuation signs
if key in positive:
pos += val
if key in negative:
neg += val
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.