简体   繁体   English

将带撇号的单词计为一个单词,但返回两个单词(python)

[英]Count the word with apostrophe as one word BUT returns two pieces of words (python)

I am new in programming and I want to make a program that can count the frequency of words from a file.我是编程新手,我想制作一个可以计算文件中单词出现频率的程序。 The expected output is as follows:预期输出如下:

WORD FREQUENCY词频

in - 1
many - 1
other - 1
programming - 1
languages - 1
you - 1
would - 1
use - 1
a - 4
type - 1
called - 1
list’s - 1
TOTAL = x

I've almost got it working, but the word "list's" returns something like this:我几乎让它工作了,但是“list's”这个词返回的是这样的:

list**â**  -  1
s  -  1

affecting the number of total words from the file.影响文件中的总字数。

I've been using regex like this:我一直在使用这样的正则表达式:

match_pattern = re.findall(r"\w+", infile)

I'm guessing that a simple expression with a defaultdict might work:我猜一个带有defaultdict的简单表达式可能会起作用:

import re
from collections import defaultdict

regex = r"(\b\w+\b)"
test_str = "some words before alice and bob Some WOrdS after Then repeat some words before Alice and BOB some words after then repeat"
matches = re.findall(regex, test_str)
print(matches)

words_dictionary = defaultdict(int)
for match in matches:
    words_dictionary[match]+=1

print(words_dictionary)

Normal Output正常输出

['some', 'words', 'before', 'alice', 'and', 'bob', 'Some', 'WOrdS', 'after', 'Then', 'repeat', 'some', 'words', 'before', 'Alice', 'and', 'BOB', 'some', 'words', 'after', 'then', 'repeat']

defaultdict(<class 'int'>, {'some': 3, 'words': 3, 'before': 2, 'alice': 1, 'and': 2, 'bob': 1, 'Some': 1, 'WOrdS': 1, 'after': 2, 'Then': 1, 'repeat': 2, 'Alice': 1, 'BOB': 1, 'then': 1})

Test with lower()lower()测试

import re
from collections import defaultdict

regex = r"(\b\w+\b)"
test_str = "some words before alice and bob Some WOrdS after Then repeat some words before Alice and BOB some words after then repeat"
matches = re.findall(regex, test_str)
print(matches)

words_dictionary = defaultdict(int)
for match in matches:
    words_dictionary[match.lower()]+=1

print(words_dictionary)

Output with lower()使用lower()输出

defaultdict(<class 'int'>, {'some': 4, 'words': 4, 'before': 2, 'alice': 2, 'and': 2, 'bob': 2, 'after': 2, 'then': 2, 'repeat': 2})

The expression is explained on the top right panel of regex101.com , if you wish to explore/simplify/modify it, and in this link , you can watch how it would match against some sample inputs, if you like.该表达式在regex101.com 的右上角面板中进行了解释,如果您希望探索/简化/修改它,并且在此链接中,您可以观看它如何与某些示例输入匹配,如果您愿意的话。


for key,value in words_dictionary.items():
    print(f'{key} - {value}')

Output输出

some - 4
words - 4
before - 2
alice - 2
and - 2
bob - 2
after - 2
then - 2
repeat - 2

Instead of using:而不是使用:

match_pattern = re.findall(r"\w+", infile)

Try use:尝试使用:

match_pattern = re.findall(r"\S+", infile)

\\w maens az AZ _ 0-9 \\w maens az AZ _ 0-9

\\S means any non space character. \\S表示任何非空格字符。

This is a solution that does not use regex.这是一个不使用正则表达式的解决方案。

I am assuming there are multiple sentences in the file.我假设文件中有多个句子。 Take the whole content as docstring and use str.split() function with split by space.将整个内容作为 docstring 并使用str.split()函数按空格分割。 You will get a list of words in that string.您将获得该字符串中的单词列表。

Next you can use collections.Counter(list) to get a dictionary which has keys as words and values as their frequency.接下来,您可以使用collections.Counter(list)来获取一个字典,其中的键是单词,值是频率。

from collections import Counter
with open('file.txt') as f:
  a = f.read()
b = dict(Counter(a.split(by = ' ')))

b is dictionary with the word-frequency pairs. b 是词频对的字典。

Note - Periods will always be kept with the last word in the sentence.注意 - 句点将始终与句子中的最后一个单词保持一致。 You can ignore them in the results, or you can remove all periods first and then do the above procedure.您可以在结果中忽略它们,也可以先删除所有句点,然后执行上述过程。 Then the '.'然后是'.' used in abbreviations will also be removed, so it may not work like you want.缩写中使用的 也将被删除,因此它可能无法像您想要的那样工作。

If you still want to use regex and match letters and apostrophe, try r"[a-zA-Z']+" and then use Counter.如果您仍然想使用正则表达式并匹配字母和撇号,请尝试 r"[a-zA-Z']+" 然后使用 Counter。 I will try to post some code for it when I get some time.当我有时间时,我会尝试为它发布一些代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python 按字长计算的字数 - Python count of words by word length 将两个连续的单词视为词频中的一个 - Consider two consecutive words as one in Word Frequency 正则表达式-匹配两个单词或一个单词,但优先选择两个单词 - Regex - Match two words or one word, but give preference to two words Python:在关键字之后找到两个单词 - Python: finding the two words following a key word 如何使用 python 在一个单词中保存 5 个单词 - how to save 5 words in one word using python 计算独特的单词并用Python创建单词和计数字典 - Count unique words and create dict with word and count in Python Python:使用.isalpha() 计算字数中的特定单词/字符 - Python: Using .isalpha() to count specific words/characters in a word count 用于单词计数,平均单词长度,单词频率和以字母开头的单词频率的Python程序 - Python program for word count, average word length, word frequency and frequency of words starting with letters of the alphabet 如何在python中计算两个单词序列 - How to count two word sequences in python Python 字数统计(包含单词的 2 个文件)(用于字数统计的 1 个文件)(在他里面写的最后一个文件 word+count) - Python word count (2 files that contains words) (1 file for word count) ( last file to write inside him word+count)
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM