简体   繁体   English

从单词列表到文本文件中的单词搜索

[英]Searching from a list of word to words in a text file

I am trying to write a program which reads a text file and then sorts it out into whether the comments in it are positive, negative or neutral. 我正在尝试编写一个程序,该程序读取文本文件,然后将其分类为其中的注释是肯定的,否定的还是中性的。 I have tried all sorts of ways to do this but each time with no avail. 我尝试了各种方法来执行此操作,但是每次都无济于事。 I can search for 1 word with no problems but any more than that and it doesn't work. 我可以毫无疑问地搜索1个单词,但除此之外不起作用。 Also, I have an if statement but I've had to use else twice underneath it as it wouldn't allow me to use elif. 另外,我有一个if语句,但是我不得不在其下两次使用else,因为它不允许我使用elif。 Any help with where I'm going wrong would be really appreciated. 对于我要去哪里的任何帮助,我们将不胜感激。 Thanks in advance. 提前致谢。

middle = open("middle_test.txt", "r")
positive = []
negative = []                                        #the empty lists
neutral = []

pos_words = ["GOOD", "GREAT", "LOVE", "AWESOME"]    #the lists I'd like to search
neg_words = ["BAD", "HATE", "SUCKS", "CRAP"]

for tweet in middle:
    words = tweet.split()
    if pos_words in words:                           #doesn't work
        positive.append(words)        
    else:                                            #can't use elif for some reason
        if 'BAD' in words:                           #works but is only 1 word not list
            negative.append(words)
        else:
            neutral.append(words)

Use a Counter , see http://docs.python.org/2/library/collections.html#collections.Counter : 使用Counter ,请参阅http://docs.python.org/2/library/collections.html#collections.Counter

import urllib2
from collections import Counter
from string import punctuation

# data from http://inclass.kaggle.com/c/si650winter11/data
target_url = "http://goo.gl/oMufKm" 
data = urllib2.urlopen(target_url).read()

word_freq = Counter([i.lower().strip(punctuation) for i in data.split()])

pos_words = ["good", "great", "love", "awesome"]
neg_words = ["bad", "hate", "sucks", "crap"]

for i in pos_words:
    try:
        print i, word_freq[i]
    except: # if word not in data
        pass

[out]: [OUT]:

good 638
great 1082
love 7716
awesome 2032

You could use the code below to count the number of positive and negative words in a paragraph: 您可以使用下面的代码计算段落中正词和负词的数量:

from collections import Counter

def readwords( filename ):
    f = open(filename)
    words = [ line.rstrip() for line in f.readlines()]
    return words

# >cat positive.txt 
# good
# awesome
# >cat negative.txt 
# bad
# ugly

positive = readwords('positive.txt')
negative = readwords('negative.txt')

print positive
print negative

paragraph = 'this is really bad and in fact awesome. really awesome.'

count = Counter(paragraph.split())

pos = 0
neg = 0
for key, val in count.iteritems():
    key = key.rstrip('.,?!\n') # removing possible punctuation signs
    if key in positive:
        pos += val
    if key in negative:
        neg += val

print pos, neg

You are not reading the lines from the file. 您没有从文件中读取行。 And this line 而这条线

if pos_words in words:

I think it is checking for the list ["GOOD", "GREAT", "LOVE", "AWESOME"] in words. 我认为它正在检查单词中的列表[“ GOOD”,“ GREAT”,“ LOVE”,“ AWESOME”]。 That is you are looking in the list of words for a list pos_words = ["GOOD", "GREAT", "LOVE", "AWESOME"]. 也就是说,您正在单词列表中寻找列表pos_words = [“ GOOD”,“ GREAT”,“ LOVE”,“ AWESOME”]。

You have some problems. 你有一些问题。 At first you can create functions that read comments from file and divides comments into words. 首先,您可以创建函数来从文件中读取注释并将注释分成单词。 Make them and check if they work as you want. 制作它们,并检查它们是否可以按照您的要求工作。 Then main procedure can look like: 然后,主要过程如下所示:

for comment in get_comments(file_name):
    words = get_words(comment)
    classified = False
    # at first look for negative comment
    for neg_word in NEGATIVE_WORDS:
        if neg_word in words:
            classified = True
            negatives.append(comment)
            break
    # now look for positive
    if not classified:
        for pos_word in POSITIVE_WORDS:
            if pos_word in words:
                classified = True
                positives.append(comment)
                break
    if not classified:
        neutral.append(comment)

be careful, open() returns a file object. 注意,open()返回一个文件对象。

>>> f = open('workfile', 'w')
>>> print f
<open file 'workfile', mode 'w' at 80a0960>

Use this: 用这个:

>>> f.readline()
'This is the first line of the file.\n'

Then use set intersection: 然后使用设置交集:

positive += list(set(pos_words) & set(tweet.split())) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM