将带撇号的单词计为一个单词，但返回两个单词（python）

Question

我是编程新手，我想制作一个可以计算文件中单词出现频率的程序。 预期输出如下：

词频

in - 1
many - 1
other - 1
programming - 1
languages - 1
you - 1
would - 1
use - 1
a - 4
type - 1
called - 1
list’s - 1
TOTAL = x

我几乎让它工作了，但是“list's”这个词返回的是这样的：

list**â**  -  1
s  -  1

影响文件中的总字数。

我一直在使用这样的正则表达式：

match_pattern = re.findall(r"\w+", infile)

Answer 1

我猜一个带有defaultdict的简单表达式可能会起作用：

import re
from collections import defaultdict

regex = r"(\b\w+\b)"
test_str = "some words before alice and bob Some WOrdS after Then repeat some words before Alice and BOB some words after then repeat"
matches = re.findall(regex, test_str)
print(matches)

words_dictionary = defaultdict(int)
for match in matches:
    words_dictionary[match]+=1

print(words_dictionary)

正常输出

['some', 'words', 'before', 'alice', 'and', 'bob', 'Some', 'WOrdS', 'after', 'Then', 'repeat', 'some', 'words', 'before', 'Alice', 'and', 'BOB', 'some', 'words', 'after', 'then', 'repeat']

defaultdict(<class 'int'>, {'some': 3, 'words': 3, 'before': 2, 'alice': 1, 'and': 2, 'bob': 1, 'Some': 1, 'WOrdS': 1, 'after': 2, 'Then': 1, 'repeat': 2, 'Alice': 1, 'BOB': 1, 'then': 1})

用`lower()`测试

import re
from collections import defaultdict

regex = r"(\b\w+\b)"
test_str = "some words before alice and bob Some WOrdS after Then repeat some words before Alice and BOB some words after then repeat"
matches = re.findall(regex, test_str)
print(matches)

words_dictionary = defaultdict(int)
for match in matches:
    words_dictionary[match.lower()]+=1

print(words_dictionary)

使用`lower()`输出

defaultdict(<class 'int'>, {'some': 4, 'words': 4, 'before': 2, 'alice': 2, 'and': 2, 'bob': 2, 'after': 2, 'then': 2, 'repeat': 2})

该表达式在regex101.com 的右上角面板中进行了解释，如果您希望探索/简化/修改它，并且在此链接中，您可以观看它如何与某些示例输入匹配，如果您愿意的话。

for key,value in words_dictionary.items():
    print(f'{key} - {value}')

输出

some - 4
words - 4
before - 2
alice - 2
and - 2
bob - 2
after - 2
then - 2
repeat - 2

Answer 2

而不是使用：

match_pattern = re.findall(r"\w+", infile)

尝试使用：

match_pattern = re.findall(r"\S+", infile)

\\w maens az AZ _ 0-9

\\S表示任何非空格字符。

Answer 3

这是一个不使用正则表达式的解决方案。

我假设文件中有多个句子。 将整个内容作为 docstring 并使用str.split()函数按空格分割。 您将获得该字符串中的单词列表。

接下来，您可以使用collections.Counter(list)来获取一个字典，其中的键是单词，值是频率。

from collections import Counter
with open('file.txt') as f:
  a = f.read()
b = dict(Counter(a.split(by = ' ')))

b 是词频对的字典。

注意 - 句点将始终与句子中的最后一个单词保持一致。 您可以在结果中忽略它们，也可以先删除所有句点，然后执行上述过程。 然后是'.' 缩写中使用的也将被删除，因此它可能无法像您想要的那样工作。

如果您仍然想使用正则表达式并匹配字母和撇号，请尝试 r"[a-zA-Z']+" 然后使用 Counter。 当我有时间时，我会尝试为它发布一些代码。

将带撇号的单词计为一个单词，但返回两个单词（python）

问题描述

3 个解决方案

解决方案1
1 已采纳 2019-07-25 05:29:05

正常输出

用`lower()`测试

使用`lower()`输出

输出

解决方案2
0

解决方案3
0 2019-07-25 05:37:22

将带撇号的单词计为一个单词，但返回两个单词（python）

问题描述

3 个解决方案

解决方案1 1 已采纳 2019-07-25 05:29:05

正常输出

用lower()测试

使用lower()输出

输出

解决方案2 0

解决方案3 0 2019-07-25 05:37:22

解决方案1
1 已采纳 2019-07-25 05:29:05

用`lower()`测试

使用`lower()`输出

解决方案2
0

解决方案3
0 2019-07-25 05:37:22