简体   繁体   English

Python常见单词查找器

[英]Python common word finder

I have a small program that looks at a text file and displays how many time the word was used. 我有一个小程序,它查看文本文件并显示使用该单词的次数。 Instead of printing words, it prints most commonly used letters not words and I don't understand what the problem. 它不打印单词,而是打印最常用的字母,而不是单词,我不明白问题出在哪里。

import re
from collections import Counter

words = re.findall(r'\w', open('words.txt').read().lower())
count = Counter(words).most_common(8)
print(count)

I hope this helps, this is a regular expression answer and should go word by word. 我希望这会有所帮助,这是一个正则表达式答案,应该逐字逐句地进行。

import re
with open("words.txt") as f:
    for line in f:
        for word in re.findall(r'\w+', line):
            # word by word

if you do not have quotes around your data and you just want one word at a time (ignoring the meaning of spaces vs line breaks in the file) try this: 如果您的数据周围没有引号,并且一次只想要一个单词(忽略文件中空格和换行符的含义),请尝试以下操作:

with open('words.txt','r') as f:
    for line in f:
        for word in line.split():
           print(word)   
import string    
words = open('words.txt').read().lower()
# skip punctuation 
words = words = words.translate(str.maketrans('', '',string.punctuation)).split()
count = Counter(words).most_common(8) 

in regex \\w means just any character, not any word. regex \\w表示任何字符,而不是任何单词。 You can get a list of words doing: 您可以获得以下单词的列表:

words= ' '.split( open('words.txt').read().lower())

And then you perform what you were doing: 然后执行您正在执行的操作:

count = Counter(words).most_common(8)
print(count)

I guess that should suffice, tell me if it isn't working. 我想就足够了,告诉我它是否不起作用。

Assuming you have following text file: 假设您有以下文本文件:

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Lorem ipsum dolor坐镇,奉献自若,sius do eiusmod tempor incididunt ut Labore et dolore magna aliqua。 Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. 尽量不要抽烟,不要因抽烟而锻炼。 Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Duis aute irure dolor in reprehenderit in volttable velit esse cillum dolore eu fugiat nulla pariatur。 Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. 不擅长于圣人的情节,应在负责任的犯罪活动中动手。

And you want to calculate words frequency: 您要计算单词频率:

import operator

with open('text.txt') as f:
    words = f.read().split()

result = {}
for word in words:
    result[word] = words.count(word)

result = sorted(result.items(), key=operator.itemgetter(1), reverse=True)
print(result)

You'll get list of words with number of occurences for each word sorted descending: 您将获得单词列表,其中每个单词的出现次数降序排列:

[('in', 3), ('dolor', 2), ('ut', 2), ('dolore', 2), ('Lorem', 1), ('ipsum', 1), ... [('in',3),('dolor',2),('ut',2),('dolore',2),('Lorem',1),('ipsum',1),。 ..

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM