Distinct words sentiment analysis

Question

I try to do a sentiment analysis based on a dictonary of 7000 words. The code works in Python, but it select all the combinations instead of distinct words.

For example, the dictionary says enter and the text says enterprise . How can I change the code that it doesn't see this as a match?

dictfile = sys.argv[1]
textfile = sys.argv[2]

a = open(textfile)
text = string.split( a.read() )
a.close()

a = open(dictfile)
lines = a.readlines()
a.close()

dic = {}
scores = {}

current_category = "Default"
scores[current_category] = 0

for line in lines:
   if line[0:2] == '>>':
       current_category = string.strip( line[2:] )
       scores[current_category] = 0
   else:
       line = line.strip()
       if len(line) > 0:
           pattern = re.compile(line, re.IGNORECASE)
           dic[pattern] = current_category

for token in text:
   for pattern in dic.keys():
       if pattern.match( token ):
           categ = dic[pattern]
           scores[categ] = scores[categ] + 1

for key in scores.keys():
   print key, ":", scores[key]

Answer 1

.match() matches from the beginning of the line. So you can use an end of line anchor in your reg ex:

re.compile(line + '$')

Or you could use word boundaries:

re.compile('\b' + line + '\b')

Answer 2

Your indention is incoherent. Some levels use 3 spaces, some use 4 spaces.
You try to match every word on your text against all 7000 words in your dictionary. Instead just look up the word in your dictionary. If it's not there, ignore the error (EAFP-principle).
Also I'm not sure if there is any advantage of using class methods ( string.split() ) over objects methods ( "".split() ).
Python also has a defaultdict which initializes a dictionary with 0 by itself.

EDIT:

Instead of .readlines() I use .read() and .split('\\n') . This gets rid of the newline characters.
Splitting the text not at the default space character but on the regexp '\\W+' (everything that's not a "word character") is my attempt to get rid of punctuation.

Below my proposed code:

import sys
from collections import defaultdict

dictfile = sys.argv[1]
textfile = sys.argv[2]

with open(textfile) as f:
    text = f.read()

with open(dictfile) as f:
    lines = f.read()

categories = {}
scores = defaultdict(int)

current_category = "Default"
scores[current_category] = 0

for line in lines.split('\n'):
    if line.startswith('>>'):
        current_category = line.strip('>')
    else:
        keyword = line.strip()
        if keyword:
            categories[keyword] = current_category

for word in re.split('\W+', text):
    try:
        scores[categories[word]] += 1
    except KeyError:
        # no in dictionary
        pass

for keyword in scores.keys():
    print("{}: {}".format(keyword, scores[keyword]))

Distinct words sentiment analysis

Question

2 answers

solution1
0 2016-12-06 13:09:10

solution2
0 2016-12-06 13:41:25

Distinct words sentiment analysis

Question

2 answers

solution1 0 2016-12-06 13:09:10

solution2 0 2016-12-06 13:41:25

solution1
0 2016-12-06 13:09:10

solution2
0 2016-12-06 13:41:25