简体   繁体   English

区别词情感分析

[英]Distinct words sentiment analysis

I try to do a sentiment analysis based on a dictonary of 7000 words. 我尝试根据7000个单词的字典进行情感分析。 The code works in Python, but it select all the combinations instead of distinct words. 该代码在Python中有效,但是它选择所有组合而不是不同的单词。

For example, the dictionary says enter and the text says enterprise . 例如,词典说enter ,文本说enterprise How can I change the code that it doesn't see this as a match? 如何更改将其视为不匹配的代码?

dictfile = sys.argv[1]
textfile = sys.argv[2]

a = open(textfile)
text = string.split( a.read() )
a.close()

a = open(dictfile)
lines = a.readlines()
a.close()

dic = {}
scores = {}

current_category = "Default"
scores[current_category] = 0

for line in lines:
   if line[0:2] == '>>':
       current_category = string.strip( line[2:] )
       scores[current_category] = 0
   else:
       line = line.strip()
       if len(line) > 0:
           pattern = re.compile(line, re.IGNORECASE)
           dic[pattern] = current_category

for token in text:
   for pattern in dic.keys():
       if pattern.match( token ):
           categ = dic[pattern]
           scores[categ] = scores[categ] + 1

for key in scores.keys():
   print key, ":", scores[key]

.match() matches from the beginning of the line. .match()从行首开始匹配。 So you can use an end of line anchor in your reg ex: 因此,您可以在reg ex中使用行尾锚:

re.compile(line + '$')

Or you could use word boundaries: 或者您可以使用单词边界:

re.compile('\b' + line + '\b')
  1. Your indention is incoherent. 您的缩进不连贯。 Some levels use 3 spaces, some use 4 spaces. 某些级别使用3个空格,有些使用4个空格。

  2. You try to match every word on your text against all 7000 words in your dictionary. 您尝试将文本中的每个单词与词典中的所有7000个单词进行匹配。 Instead just look up the word in your dictionary. 而是只是在字典中查找单词。 If it's not there, ignore the error (EAFP-principle). 如果不存在,请忽略该错误(EAFP原理)。

  3. Also I'm not sure if there is any advantage of using class methods ( string.split() ) over objects methods ( "".split() ). 另外,我不确定使用类方法( string.split() )是否比对象方法( "".split() )有优势。

  4. Python also has a defaultdict which initializes a dictionary with 0 by itself. Python也有一个defaultdict ,它自己用0初始化字典。

EDIT: 编辑:

  1. Instead of .readlines() I use .read() and .split('\\n') . 代替.readlines()我使用.read().split('\\n') This gets rid of the newline characters. 这摆脱了换行符。

  2. Splitting the text not at the default space character but on the regexp '\\W+' (everything that's not a "word character") is my attempt to get rid of punctuation. 我试图消除标点符号的目的不是在默认的空格字符处而是在正则表达式'\\W+'不是 “单词字符”的所有内容)上拆分文本。

Below my proposed code: 在我建议的代码下面:

import sys
from collections import defaultdict

dictfile = sys.argv[1]
textfile = sys.argv[2]

with open(textfile) as f:
    text = f.read()

with open(dictfile) as f:
    lines = f.read()

categories = {}
scores = defaultdict(int)

current_category = "Default"
scores[current_category] = 0

for line in lines.split('\n'):
    if line.startswith('>>'):
        current_category = line.strip('>')
    else:
        keyword = line.strip()
        if keyword:
            categories[keyword] = current_category

for word in re.split('\W+', text):
    try:
        scores[categories[word]] += 1
    except KeyError:
        # no in dictionary
        pass

for keyword in scores.keys():
    print("{}: {}".format(keyword, scores[keyword]))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM