繁体   English   中英

显示每个单词的字数

[英]Showing the Word Count for Each Word

我很难对 Google Colab 上的文档呼啸山庄 ( https://www.gutenberg.org/files/768/768.txt ) 进行前 15 个字数统计(每个词的字数统计)。 它只能包含在“ccx074@pglaf.org”之后开始并在“END OF THE PROJECT GUTENBERG EBOOK WUTHERING HEIGHTS”之前结束的词。 这是我尝试的编码。

file = open(768.txt,'r+')
wordcount = {}
for word in file.read().split():
    if word not in wordcount:
        wordcount[word] = 1
    else:
        wordcount[word] +=1
for k,v in wordcount.items():
    print(k,v)

您可以使用正则表达式来查找所需的子字符串:

file = open('768.txt','r')
start = 'ccx074@pglaf.org'
end = 'END OF THE PROJECT GUTENBERG EBOOK WUTHERING HEIGHTS'
import re

m = re.findall(start+'(.*?)'+end, file.read(), flags=re.S)[0]
wordcount={}
for word in m.split():
  if word not in wordcount:
    wordcount[word] = 1
  else:
      wordcount[word] +=1
for k,v in wordcount.items():
  print(k,v)

示例输出:

WUTHERING 1
HEIGHTS 1
CHAPTER 34
I 3215
1801.--I 1
have 594
just 72
returned 39
from 476
...

但是,您可以使用内置函数计算单词数。 例如,这个:

from collections import Counter
print(Counter(m.split()))

#Counter({'the': 4273, 'and': 4189, 'to': 3436, ...})

编辑:打印排序:

sorted(Counter(m.split()).items(), key=lambda x:x[1])

或从高到低反转:

sorted(Counter(m.split()).items(), key=lambda x:x[1], reverse=True)

string punctuationoperator itemgetter的帮助下,这可能是一种方法。 这将接近。 请注意,删除标点符号将消除结尾 (.!?),以获得干净的单词。 (还删除撇号(您可能不想删除)

from collections import Counter
from string import punctuation
from operator import itemgetter

d = Counter()

with open('wuthering_heights.txt', 'r') as f:
    opening = False

    for line in f:
        if line.startswith('ccx074@pglaf.org'):
            opening = True
        if opening == False:
            continue
        if line.startswith('CHAPTER'): # don't count chapter headings
            continue
        if line.startswith('***END OF THE PROJECT GUTENBERG EBOOK'):
            break
        
        line = line.strip()
        if len(line) == 0:
            continue
        
        # clean out punctuation
        line = line.translate(str.maketrans('','',punctuation))
        
        d.update(line.lower().split())

        

print('different words count', len(d)        )
#print(d.most_common(15))

for word, count in reversed(sorted(d.items(), key=itemgetter(1))):
    print(word, count)
    if count < 290:
        break

这打印:

different words count 10098
and 4693
the 4552
i 3530
to 3476
a 2301
of 2221
he 1922
you 1712
her 1544
in 1459
his 1419
it 1284
she 1269
that 1188
was 1124
my 1098
me 1047
not 932
as 931
him 917
for 836
on 809
with 804
at 783
be 724
had 687
but 673
is 649
have 629
from 485
by 451
would 442
if 440
heathcliff 413
your 404
no 384
said 368
so 357
were 354
linton 340
catherine 333
an 317
we 311
mr 309
or 307
when 307
out 305
what 301
are 295
this 290
they 283

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM