計算一段文本中最常見的標題詞

Question

我必須做一個任務，打開一個文本文件，然后計算每個單詞大寫的次數。 然后我需要打印前 3 次出現。 這段代碼會一直工作，直到它得到一個文本文件，其中的單詞在一行中加倍。

txt文件1：

Jellicle Cats are black and white,
Jellicle Cats are rather small;
Jellicle Cats are merry and bright,
And pleasant to hear when they caterwaul.
Jellicle Cats have cheerful faces,
Jellicle Cats have bright black eyes;
They like to practise their airs and graces
And wait for the Jellicle Moon to rise.

結果：

6 Jellicle
5 Cats
2 And

txt文件2：

Baa Baa black sheep have you any wool?
Yes sir Yes sir, wool for everyone.
One for the master, 
One for the dame.
One for the little boy who lives down the lane.

結果：

1 Baa
1 One
1 Yes
1 Baa
1 One
1 Yes
1 Baa
1 One
1 Yes

這是我的代碼：

wc = {}
t3 = {}
p = 0
xx=0
a = open('novel.txt').readlines()
for i in a:
  b = i.split()
  for l in b:
    if l[0].isupper():
      if l not in wc:
         wc[l] = 1
      else:
        wc[l] += 1
while p < 3:
  p += 1
  max_val=max(wc.values())
  for words in wc:
    if wc[words] == max_val:
      t3[words] = wc[words]
      wc[words] = 1

    else:
      null = 1
while xx < 3:
  xx+=1
  maxval = max(t3.values())
  for word in sorted(t3):
    if t3[word] == maxval:
      print(t3[word],word)
      t3[word] = 1
    else:
      null+=1

請幫我解決這個問題。 謝謝！

謝謝你的所有建議。 在手動調試代碼並使用您的回復后，我發現while xx < 3:是不必要的，但wc[words] = 1最終使程序對單詞進行了雙重計數，如果是第三個的話出現詞出現一次。 通過用wc[words] = 0替換它，我能夠避免出現計數循環。

謝謝！

Answer 1

這個超級簡單。 但是你需要一些工具。

re.sub ，去掉標點符號
filter ，使用str.istitle按標題大小寫過濾掉單詞
collections.Counter ，計算單詞（首先from collections import Counter ）。

假設text包含您的段落（第一個），這有效：

In [296]: Counter(filter(str.istitle, re.sub('[^\w\s]', '', text).split())).most_common(3)
Out[296]: [('Jellicle', 6), ('Cats', 5), ('And', 2)]

Counter.most_common(x)返回x最常見的詞。

巧合的是，這是您的第二段的輸出：

[('One', 3), ('Baa', 2), ('Yes', 2)]

Answer 2

import operator

fname = 'novel.txt'
fptr = open(fname)
x = fptr.read()
words = x.split()
data = {}
p = 0

for word in words:
    if word[0].isupper():
        if word in data:
            data[word] = data[word] + 1
        else:
            data[word] = 1

valores_ord = dict(sorted(data.items(), key=operator.itemgetter(1), reverse=True)[:3])

for word in valores_ord:
    print(valores_ord[word],word)

計算一段文本中最常見的標題詞

問題描述

2 個解決方案

解決方案1
4 已采納 2017-08-05 05:05:50

解決方案2
0 2020-06-18 15:01:20

計算一段文本中最常見的標題詞

問題描述

2 個解決方案

解決方案1 4 已采納 2017-08-05 05:05:50

解決方案2 0 2020-06-18 15:01:20

解決方案1
4 已采納 2017-08-05 05:05:50

解決方案2
0 2020-06-18 15:01:20