[英]Count the word with apostrophe as one word BUT returns two pieces of words (python)
I am new in programming and I want to make a program that can count the frequency of words from a file.我是编程新手,我想制作一个可以计算文件中单词出现频率的程序。 The expected output is as follows:
预期输出如下:
WORD FREQUENCY词频
in - 1
many - 1
other - 1
programming - 1
languages - 1
you - 1
would - 1
use - 1
a - 4
type - 1
called - 1
list’s - 1
TOTAL = x
I've almost got it working, but the word "list's" returns something like this:我几乎让它工作了,但是“list's”这个词返回的是这样的:
list**â** - 1
s - 1
affecting the number of total words from the file.影响文件中的总字数。
I've been using regex like this:我一直在使用这样的正则表达式:
match_pattern = re.findall(r"\w+", infile)
I'm guessing that a simple expression with a defaultdict
might work:我猜一个带有
defaultdict
的简单表达式可能会起作用:
import re
from collections import defaultdict
regex = r"(\b\w+\b)"
test_str = "some words before alice and bob Some WOrdS after Then repeat some words before Alice and BOB some words after then repeat"
matches = re.findall(regex, test_str)
print(matches)
words_dictionary = defaultdict(int)
for match in matches:
words_dictionary[match]+=1
print(words_dictionary)
['some', 'words', 'before', 'alice', 'and', 'bob', 'Some', 'WOrdS', 'after', 'Then', 'repeat', 'some', 'words', 'before', 'Alice', 'and', 'BOB', 'some', 'words', 'after', 'then', 'repeat']
defaultdict(<class 'int'>, {'some': 3, 'words': 3, 'before': 2, 'alice': 1, 'and': 2, 'bob': 1, 'Some': 1, 'WOrdS': 1, 'after': 2, 'Then': 1, 'repeat': 2, 'Alice': 1, 'BOB': 1, 'then': 1})
lower()
lower()
测试import re
from collections import defaultdict
regex = r"(\b\w+\b)"
test_str = "some words before alice and bob Some WOrdS after Then repeat some words before Alice and BOB some words after then repeat"
matches = re.findall(regex, test_str)
print(matches)
words_dictionary = defaultdict(int)
for match in matches:
words_dictionary[match.lower()]+=1
print(words_dictionary)
lower()
lower()
输出defaultdict(<class 'int'>, {'some': 4, 'words': 4, 'before': 2, 'alice': 2, 'and': 2, 'bob': 2, 'after': 2, 'then': 2, 'repeat': 2})
The expression is explained on the top right panel of regex101.com , if you wish to explore/simplify/modify it, and in this link , you can watch how it would match against some sample inputs, if you like.该表达式在regex101.com 的右上角面板中进行了解释,如果您希望探索/简化/修改它,并且在此链接中,您可以观看它如何与某些示例输入匹配,如果您愿意的话。
for key,value in words_dictionary.items():
print(f'{key} - {value}')
some - 4
words - 4
before - 2
alice - 2
and - 2
bob - 2
after - 2
then - 2
repeat - 2
Instead of using:而不是使用:
match_pattern = re.findall(r"\w+", infile)
Try use:尝试使用:
match_pattern = re.findall(r"\S+", infile)
\\w
maens az AZ _ 0-9 \\w
maens az AZ _ 0-9
\\S
means any non space character. \\S
表示任何非空格字符。
This is a solution that does not use regex.这是一个不使用正则表达式的解决方案。
I am assuming there are multiple sentences in the file.我假设文件中有多个句子。 Take the whole content as docstring and use
str.split()
function with split by space.将整个内容作为 docstring 并使用
str.split()
函数按空格分割。 You will get a list of words in that string.您将获得该字符串中的单词列表。
Next you can use collections.Counter(list)
to get a dictionary which has keys as words and values as their frequency.接下来,您可以使用
collections.Counter(list)
来获取一个字典,其中的键是单词,值是频率。
from collections import Counter
with open('file.txt') as f:
a = f.read()
b = dict(Counter(a.split(by = ' ')))
b is dictionary with the word-frequency pairs. b 是词频对的字典。
Note - Periods will always be kept with the last word in the sentence.注意 - 句点将始终与句子中的最后一个单词保持一致。 You can ignore them in the results, or you can remove all periods first and then do the above procedure.
您可以在结果中忽略它们,也可以先删除所有句点,然后执行上述过程。 Then the '.'
然后是'.' used in abbreviations will also be removed, so it may not work like you want.
缩写中使用的 也将被删除,因此它可能无法像您想要的那样工作。
If you still want to use regex and match letters and apostrophe, try r"[a-zA-Z']+" and then use Counter.如果您仍然想使用正则表达式并匹配字母和撇号,请尝试 r"[a-zA-Z']+" 然后使用 Counter。 I will try to post some code for it when I get some time.
当我有时间时,我会尝试为它发布一些代码。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.