用于单词计数，平均单词长度，单词频率和以字母开头的单词频率的Python程序

Question

Need to write a Python program that analyzes a file and counts: 需要编写一个Python程序来分析文件并计数：

The number of words 字数
The average length of a word 一个单词的平均长度
How many times each word occurs 每个单词出现多少次
How many words start with each letter of the alphabet 字母表中的每个字母开头多少个单词

I've got the code to do the first 2 things: 我有执行前两件事的代码：

with open(input('Please enter the full name of the file: '),'r') as f:
     w = [len(word) for line in f for word in line.rstrip().split(" ")]
     total_w = len(w)
     avg_w = sum(w)/total_w

print('The total number of words in this file is:', total_w)
print('The average length of the words in this file is:', avg_w)

But I'm not sure on how to do the others. 但是我不确定其他人该怎么做。 Any help is appreciated. 任何帮助表示赞赏。

Btw, when I say "How many words start with each letter of the alphabet" I mean how many words start with "A", how many start with "B", how many start with "C", etc all the way through to "Z". 顺便说一句，当我说“有多少个单词以字母的每个字母开头”时，我的意思是有多少个单词以“ A”开头，有多少个以“ B”开头，有多少个以“ C”开头等等，一直到“Z”。

Answer 1

Interesting challenge you were given, i made a proposition for question 3, how many times a word occurs inside the string. 给了您有趣的挑战，我对问题3提出了一个命题，即单词在字符串中出现了多少次。 This code is not optimal at all, but it does work. 该代码根本不是最佳代码，但它确实有效。

also i used the file text.txt 我也使用了文件text.txt

edit: noticed i forgot to create wordlist as it was saved in my ram memory 编辑：注意到我忘记创建单词表，因为它保存在我的内存中

with open('text.txt', 'r') as doc:
    print('opened txt')
    for words in doc:
        wordlist = words.split()     

for numbers in range(len(wordlist)):
        for inner_numbers in range(len(wordlist)):
            if inner_numbers != numbers:
                if wordlist[numbers] == wordlist[inner_numbers]:
                    print('word: %s == %s' %(wordlist[numbers], wordlist[inner_numbers]))

Answer to question four: This one wasn't really hard after you have created a list with all the words since strings can be treated like a list and you can easily get the first letter of the string by simply doing string[0] and if its a list with strings stringList[position of word][0] 问题四的答案：在您创建了包含所有单词的列表之后，这并不难，因为可以将字符串视为列表，并且只需执行string[0]即可轻松获得字符串的第一个字母，如果它是一个包含字符串的列表stringList[position of word][0]

for numbers in range(len(wordlist)):
        if wordlist[numbers][0] == 'a':
            print(wordlist[numbers])

Answer 2

There are many ways to achieve this, a more advanced approach would involve an initial simple gathering of the text and its words, then working on the data with ML/DS tools, with which you could extrapolate more statistics (Things like "a new paragraph starts mostly with X words" / "X words are mostly preceeded/succeeded by Y words" etc.) 有很多方法可以实现这一点，一种更高级的方法是首先简单地收集文本和单词，然后使用ML / DS工具处理数据，利用该工具可以推断出更多的统计信息（例如“新段落”大多以X词开头” /“ X词大多以Y词开头/成功”等）

If you just need very basic statistics you can gather them while iterating over each word and do the calculations at the end of it, like: 如果您只需要非常基本的统计信息，则可以在遍历每个单词的同时收集它们并在其末尾进行计算，例如：

stats = {
  'amount': 0,
  'length': 0,
  'word_count': {},
  'initial_count': {}
}

with open('lorem.txt', 'r') as f:
  for line in f:
    line = line.strip()
    if not line:
      continue
    for word in line.split():
      word = word.lower()
      initial = word[0]

      # Add word and length count
      stats['amount'] += 1
      stats['length'] += len(word)

      # Add initial count
      if not initial in stats['initial_count']:
        stats['initial_count'][initial] = 0
      stats['initial_count'][initial] += 1

      # Add word count
      if not word in stats['word_count']:
        stats['word_count'][word] = 0
      stats['word_count'][word] += 1

# Calculate average word length
stats['average_length'] = stats['length'] / stats['amount']

Online Demo here 此处在线演示

用于单词计数，平均单词长度，单词频率和以字母开头的单词频率的Python程序

问题描述

2 个解决方案

解决方案1
0 2018-08-26 20:07:51

解决方案2
0 已采纳 2018-08-26 21:06:29

用于单词计数，平均单词长度，单词频率和以字母开头的单词频率的Python程序

问题描述

2 个解决方案

解决方案1 0 2018-08-26 20:07:51

解决方案2 0 已采纳 2018-08-26 21:06:29

解决方案1
0 2018-08-26 20:07:51

解决方案2
0 已采纳 2018-08-26 21:06:29