简体   繁体   中英

Distinct words sentiment analysis

I try to do a sentiment analysis based on a dictonary of 7000 words. The code works in Python, but it select all the combinations instead of distinct words.

For example, the dictionary says enter and the text says enterprise . How can I change the code that it doesn't see this as a match?

dictfile = sys.argv[1]
textfile = sys.argv[2]

a = open(textfile)
text = string.split( a.read() )
a.close()

a = open(dictfile)
lines = a.readlines()
a.close()

dic = {}
scores = {}

current_category = "Default"
scores[current_category] = 0

for line in lines:
   if line[0:2] == '>>':
       current_category = string.strip( line[2:] )
       scores[current_category] = 0
   else:
       line = line.strip()
       if len(line) > 0:
           pattern = re.compile(line, re.IGNORECASE)
           dic[pattern] = current_category

for token in text:
   for pattern in dic.keys():
       if pattern.match( token ):
           categ = dic[pattern]
           scores[categ] = scores[categ] + 1

for key in scores.keys():
   print key, ":", scores[key]

.match() matches from the beginning of the line. So you can use an end of line anchor in your reg ex:

re.compile(line + '$')

Or you could use word boundaries:

re.compile('\b' + line + '\b')
  1. Your indention is incoherent. Some levels use 3 spaces, some use 4 spaces.

  2. You try to match every word on your text against all 7000 words in your dictionary. Instead just look up the word in your dictionary. If it's not there, ignore the error (EAFP-principle).

  3. Also I'm not sure if there is any advantage of using class methods ( string.split() ) over objects methods ( "".split() ).

  4. Python also has a defaultdict which initializes a dictionary with 0 by itself.

EDIT:

  1. Instead of .readlines() I use .read() and .split('\\n') . This gets rid of the newline characters.

  2. Splitting the text not at the default space character but on the regexp '\\W+' (everything that's not a "word character") is my attempt to get rid of punctuation.

Below my proposed code:

import sys
from collections import defaultdict

dictfile = sys.argv[1]
textfile = sys.argv[2]

with open(textfile) as f:
    text = f.read()

with open(dictfile) as f:
    lines = f.read()

categories = {}
scores = defaultdict(int)

current_category = "Default"
scores[current_category] = 0

for line in lines.split('\n'):
    if line.startswith('>>'):
        current_category = line.strip('>')
    else:
        keyword = line.strip()
        if keyword:
            categories[keyword] = current_category

for word in re.split('\W+', text):
    try:
        scores[categories[word]] += 1
    except KeyError:
        # no in dictionary
        pass

for keyword in scores.keys():
    print("{}: {}".format(keyword, scores[keyword]))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM