简体   繁体   中英

Count the word with apostrophe as one word BUT returns two pieces of words (python)

I am new in programming and I want to make a program that can count the frequency of words from a file. The expected output is as follows:

WORD FREQUENCY

in - 1
many - 1
other - 1
programming - 1
languages - 1
you - 1
would - 1
use - 1
a - 4
type - 1
called - 1
list’s - 1
TOTAL = x

I've almost got it working, but the word "list's" returns something like this:

list**â**  -  1
s  -  1

affecting the number of total words from the file.

I've been using regex like this:

match_pattern = re.findall(r"\w+", infile)

I'm guessing that a simple expression with a defaultdict might work:

import re
from collections import defaultdict

regex = r"(\b\w+\b)"
test_str = "some words before alice and bob Some WOrdS after Then repeat some words before Alice and BOB some words after then repeat"
matches = re.findall(regex, test_str)
print(matches)

words_dictionary = defaultdict(int)
for match in matches:
    words_dictionary[match]+=1

print(words_dictionary)

Normal Output

['some', 'words', 'before', 'alice', 'and', 'bob', 'Some', 'WOrdS', 'after', 'Then', 'repeat', 'some', 'words', 'before', 'Alice', 'and', 'BOB', 'some', 'words', 'after', 'then', 'repeat']

defaultdict(<class 'int'>, {'some': 3, 'words': 3, 'before': 2, 'alice': 1, 'and': 2, 'bob': 1, 'Some': 1, 'WOrdS': 1, 'after': 2, 'Then': 1, 'repeat': 2, 'Alice': 1, 'BOB': 1, 'then': 1})

Test with lower()

import re
from collections import defaultdict

regex = r"(\b\w+\b)"
test_str = "some words before alice and bob Some WOrdS after Then repeat some words before Alice and BOB some words after then repeat"
matches = re.findall(regex, test_str)
print(matches)

words_dictionary = defaultdict(int)
for match in matches:
    words_dictionary[match.lower()]+=1

print(words_dictionary)

Output with lower()

defaultdict(<class 'int'>, {'some': 4, 'words': 4, 'before': 2, 'alice': 2, 'and': 2, 'bob': 2, 'after': 2, 'then': 2, 'repeat': 2})

The expression is explained on the top right panel of regex101.com , if you wish to explore/simplify/modify it, and in this link , you can watch how it would match against some sample inputs, if you like.


for key,value in words_dictionary.items():
    print(f'{key} - {value}')

Output

some - 4
words - 4
before - 2
alice - 2
and - 2
bob - 2
after - 2
then - 2
repeat - 2

Instead of using:

match_pattern = re.findall(r"\w+", infile)

Try use:

match_pattern = re.findall(r"\S+", infile)

\\w maens az AZ _ 0-9

\\S means any non space character.

This is a solution that does not use regex.

I am assuming there are multiple sentences in the file. Take the whole content as docstring and use str.split() function with split by space. You will get a list of words in that string.

Next you can use collections.Counter(list) to get a dictionary which has keys as words and values as their frequency.

from collections import Counter
with open('file.txt') as f:
  a = f.read()
b = dict(Counter(a.split(by = ' ')))

b is dictionary with the word-frequency pairs.

Note - Periods will always be kept with the last word in the sentence. You can ignore them in the results, or you can remove all periods first and then do the above procedure. Then the '.' used in abbreviations will also be removed, so it may not work like you want.

If you still want to use regex and match letters and apostrophe, try r"[a-zA-Z']+" and then use Counter. I will try to post some code for it when I get some time.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM