简体   繁体   English

计算一段文本中最常见的标题词

[英]Count most common titular words in a paragraph of text

I have to do a task where I open a text file, then count the number of times each word is capitalised.我必须做一个任务,打开一个文本文件,然后计算每个单词大写的次数。 Then I need to print the top 3 occurrences.然后我需要打印前 3 次出现。 This piece of code works until it gets a text file with words that double up in a line.这段代码会一直工作,直到它得到一个文本文件,其中的单词在一行中加倍。

txt file 1: txt文件1:

Jellicle Cats are black and white,
Jellicle Cats are rather small;
Jellicle Cats are merry and bright,
And pleasant to hear when they caterwaul.
Jellicle Cats have cheerful faces,
Jellicle Cats have bright black eyes;
They like to practise their airs and graces
And wait for the Jellicle Moon to rise.

Results:结果:

6 Jellicle
5 Cats
2 And

txt file 2: txt文件2:

Baa Baa black sheep have you any wool?
Yes sir Yes sir, wool for everyone.
One for the master, 
One for the dame.
One for the little boy who lives down the lane.

Results:结果:

1 Baa
1 One
1 Yes
1 Baa
1 One
1 Yes
1 Baa
1 One
1 Yes

Here is my code:这是我的代码:

wc = {}
t3 = {}
p = 0
xx=0
a = open('novel.txt').readlines()
for i in a:
  b = i.split()
  for l in b:
    if l[0].isupper():
      if l not in wc:
         wc[l] = 1
      else:
        wc[l] += 1
while p < 3:
  p += 1
  max_val=max(wc.values())
  for words in wc:
    if wc[words] == max_val:
      t3[words] = wc[words]
      wc[words] = 1

    else:
      null = 1
while xx < 3:
  xx+=1
  maxval = max(t3.values())
  for word in sorted(t3):
    if t3[word] == maxval:
      print(t3[word],word)
      t3[word] = 1
    else:
      null+=1

Please help me solve this.请帮我解决这个问题。 Thank You!谢谢!

Thank you for all the suggestions.谢谢你的所有建议。 After manually debugging the code, as well as using your responses, I was able to figure out that while xx < 3: was unnecessary, as well as wc[words] = 1 ended up making the program double count the words if the third most occurring word occurred once.在手动调试代码并使用您的回复后,我发现while xx < 3:是不必要的,但wc[words] = 1最终使程序对单词进行了双重计数,如果是第三个的话出现词出现一次。 By replacing it with wc[words] = 0 I was able to avoid having a counting loop.通过用wc[words] = 0替换它,我能够避免出现计数循环。

Thank you!谢谢!

This is super simple.这个超级简单。 But you'll need a few tools.但是你需要一些工具。

  1. re.sub , to get rid of punctuation re.sub ,去掉标点符号

  2. filter , to filter out words by title case using str.istitle filter ,使用str.istitle按标题大小写过滤掉单词

  3. collections.Counter , to count words (do from collections import Counter first). collections.Counter ,计算单词(首先from collections import Counter )。


Assuming text holds your para (first one), this works:假设text包含您的段落(第一个),这有效:

In [296]: Counter(filter(str.istitle, re.sub('[^\w\s]', '', text).split())).most_common(3)
Out[296]: [('Jellicle', 6), ('Cats', 5), ('And', 2)]

Counter.most_common(x) returns the x most common words. Counter.most_common(x)返回x最常见的词。

Coincidentally, this is the output for your second para:巧合的是,这是您的第二段的输出:

[('One', 3), ('Baa', 2), ('Yes', 2)]
import operator

fname = 'novel.txt'
fptr = open(fname)
x = fptr.read()
words = x.split()
data = {}
p = 0

for word in words:
    if word[0].isupper():
        if word in data:
            data[word] = data[word] + 1
        else:
            data[word] = 1

valores_ord = dict(sorted(data.items(), key=operator.itemgetter(1), reverse=True)[:3])

for word in valores_ord:
    print(valores_ord[word],word)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM