计算一段文本中最常见的标题词

Question

我必须做一个任务，打开一个文本文件，然后计算每个单词大写的次数。 然后我需要打印前 3 次出现。 这段代码会一直工作，直到它得到一个文本文件，其中的单词在一行中加倍。

txt文件1：

Jellicle Cats are black and white,
Jellicle Cats are rather small;
Jellicle Cats are merry and bright,
And pleasant to hear when they caterwaul.
Jellicle Cats have cheerful faces,
Jellicle Cats have bright black eyes;
They like to practise their airs and graces
And wait for the Jellicle Moon to rise.

结果：

6 Jellicle
5 Cats
2 And

txt文件2：

Baa Baa black sheep have you any wool?
Yes sir Yes sir, wool for everyone.
One for the master, 
One for the dame.
One for the little boy who lives down the lane.

结果：

1 Baa
1 One
1 Yes
1 Baa
1 One
1 Yes
1 Baa
1 One
1 Yes

这是我的代码：

wc = {}
t3 = {}
p = 0
xx=0
a = open('novel.txt').readlines()
for i in a:
  b = i.split()
  for l in b:
    if l[0].isupper():
      if l not in wc:
         wc[l] = 1
      else:
        wc[l] += 1
while p < 3:
  p += 1
  max_val=max(wc.values())
  for words in wc:
    if wc[words] == max_val:
      t3[words] = wc[words]
      wc[words] = 1

    else:
      null = 1
while xx < 3:
  xx+=1
  maxval = max(t3.values())
  for word in sorted(t3):
    if t3[word] == maxval:
      print(t3[word],word)
      t3[word] = 1
    else:
      null+=1

请帮我解决这个问题。 谢谢！

谢谢你的所有建议。 在手动调试代码并使用您的回复后，我发现while xx < 3:是不必要的，但wc[words] = 1最终使程序对单词进行了双重计数，如果是第三个的话出现词出现一次。 通过用wc[words] = 0替换它，我能够避免出现计数循环。

谢谢！

Answer 1

这个超级简单。 但是你需要一些工具。

re.sub ，去掉标点符号
filter ，使用str.istitle按标题大小写过滤掉单词
collections.Counter ，计算单词（首先from collections import Counter ）。

假设text包含您的段落（第一个），这有效：

In [296]: Counter(filter(str.istitle, re.sub('[^\w\s]', '', text).split())).most_common(3)
Out[296]: [('Jellicle', 6), ('Cats', 5), ('And', 2)]

Counter.most_common(x)返回x最常见的词。

巧合的是，这是您的第二段的输出：

[('One', 3), ('Baa', 2), ('Yes', 2)]

Answer 2

import operator

fname = 'novel.txt'
fptr = open(fname)
x = fptr.read()
words = x.split()
data = {}
p = 0

for word in words:
    if word[0].isupper():
        if word in data:
            data[word] = data[word] + 1
        else:
            data[word] = 1

valores_ord = dict(sorted(data.items(), key=operator.itemgetter(1), reverse=True)[:3])

for word in valores_ord:
    print(valores_ord[word],word)

计算一段文本中最常见的标题词

问题描述

2 个解决方案

解决方案1
4 已采纳 2017-08-05 05:05:50

解决方案2
0 2020-06-18 15:01:20

计算一段文本中最常见的标题词

问题描述

2 个解决方案

解决方案1 4 已采纳 2017-08-05 05:05:50

解决方案2 0 2020-06-18 15:01:20

解决方案1
4 已采纳 2017-08-05 05:05:50

解决方案2
0 2020-06-18 15:01:20