将带撇号的单词计为一个单词，但返回两个单词（python）

Question

I am new in programming and I want to make a program that can count the frequency of words from a file.我是编程新手，我想制作一个可以计算文件中单词出现频率的程序。 The expected output is as follows:预期输出如下：

WORD FREQUENCY词频

in - 1
many - 1
other - 1
programming - 1
languages - 1
you - 1
would - 1
use - 1
a - 4
type - 1
called - 1
list’s - 1
TOTAL = x

I've almost got it working, but the word "list's" returns something like this:我几乎让它工作了，但是“list's”这个词返回的是这样的：

list**â**  -  1
s  -  1

affecting the number of total words from the file.影响文件中的总字数。

I've been using regex like this:我一直在使用这样的正则表达式：

match_pattern = re.findall(r"\w+", infile)

Answer 1

I'm guessing that a simple expression with a defaultdict might work:我猜一个带有defaultdict的简单表达式可能会起作用：

import re
from collections import defaultdict

regex = r"(\b\w+\b)"
test_str = "some words before alice and bob Some WOrdS after Then repeat some words before Alice and BOB some words after then repeat"
matches = re.findall(regex, test_str)
print(matches)

words_dictionary = defaultdict(int)
for match in matches:
    words_dictionary[match]+=1

print(words_dictionary)

Normal Output正常输出

['some', 'words', 'before', 'alice', 'and', 'bob', 'Some', 'WOrdS', 'after', 'Then', 'repeat', 'some', 'words', 'before', 'Alice', 'and', 'BOB', 'some', 'words', 'after', 'then', 'repeat']

defaultdict(<class 'int'>, {'some': 3, 'words': 3, 'before': 2, 'alice': 1, 'and': 2, 'bob': 1, 'Some': 1, 'WOrdS': 1, 'after': 2, 'Then': 1, 'repeat': 2, 'Alice': 1, 'BOB': 1, 'then': 1})

Test with `lower()`用`lower()`测试

import re
from collections import defaultdict

regex = r"(\b\w+\b)"
test_str = "some words before alice and bob Some WOrdS after Then repeat some words before Alice and BOB some words after then repeat"
matches = re.findall(regex, test_str)
print(matches)

words_dictionary = defaultdict(int)
for match in matches:
    words_dictionary[match.lower()]+=1

print(words_dictionary)

Output with `lower()`使用`lower()`输出

defaultdict(<class 'int'>, {'some': 4, 'words': 4, 'before': 2, 'alice': 2, 'and': 2, 'bob': 2, 'after': 2, 'then': 2, 'repeat': 2})

The expression is explained on the top right panel of regex101.com , if you wish to explore/simplify/modify it, and in this link , you can watch how it would match against some sample inputs, if you like.该表达式在regex101.com 的右上角面板中进行了解释，如果您希望探索/简化/修改它，并且在此链接中，您可以观看它如何与某些示例输入匹配，如果您愿意的话。

for key,value in words_dictionary.items():
    print(f'{key} - {value}')

Output输出

some - 4
words - 4
before - 2
alice - 2
and - 2
bob - 2
after - 2
then - 2
repeat - 2

Answer 2

Instead of using:而不是使用：

match_pattern = re.findall(r"\w+", infile)

Try use:尝试使用：

match_pattern = re.findall(r"\S+", infile)

\\w maens az AZ _ 0-9 \\w maens az AZ _ 0-9

\\S means any non space character. \\S表示任何非空格字符。

Answer 3

This is a solution that does not use regex.这是一个不使用正则表达式的解决方案。

I am assuming there are multiple sentences in the file.我假设文件中有多个句子。 Take the whole content as docstring and use str.split() function with split by space.将整个内容作为 docstring 并使用str.split()函数按空格分割。 You will get a list of words in that string.您将获得该字符串中的单词列表。

Next you can use collections.Counter(list) to get a dictionary which has keys as words and values as their frequency.接下来，您可以使用collections.Counter(list)来获取一个字典，其中的键是单词，值是频率。

from collections import Counter
with open('file.txt') as f:
  a = f.read()
b = dict(Counter(a.split(by = ' ')))

b is dictionary with the word-frequency pairs. b 是词频对的字典。

Note - Periods will always be kept with the last word in the sentence.注意 - 句点将始终与句子中的最后一个单词保持一致。 You can ignore them in the results, or you can remove all periods first and then do the above procedure.您可以在结果中忽略它们，也可以先删除所有句点，然后执行上述过程。 Then the '.'然后是'.' used in abbreviations will also be removed, so it may not work like you want.缩写中使用的也将被删除，因此它可能无法像您想要的那样工作。

If you still want to use regex and match letters and apostrophe, try r"[a-zA-Z']+" and then use Counter.如果您仍然想使用正则表达式并匹配字母和撇号，请尝试 r"[a-zA-Z']+" 然后使用 Counter。 I will try to post some code for it when I get some time.当我有时间时，我会尝试为它发布一些代码。

将带撇号的单词计为一个单词，但返回两个单词（python）

问题描述

3 个解决方案

解决方案1
1 已采纳 2019-07-25 05:29:05

Normal Output正常输出

Test with `lower()`用`lower()`测试

Output with `lower()`使用`lower()`输出

Output输出

解决方案2
0

解决方案3
0 2019-07-25 05:37:22

将带撇号的单词计为一个单词，但返回两个单词（python）

问题描述

3 个解决方案

解决方案1 1 已采纳 2019-07-25 05:29:05

Normal Output正常输出

Test with lower()用lower()测试

Output with lower()使用lower()输出

Output输出

解决方案2 0

解决方案3 0 2019-07-25 05:37:22

解决方案1
1 已采纳 2019-07-25 05:29:05

Test with `lower()`用`lower()`测试

Output with `lower()`使用`lower()`输出

解决方案2
0

解决方案3
0 2019-07-25 05:37:22