简体   繁体   English

用于单词计数,平均单词长度,单词频率和以字母开头的单词频率的Python程序

[英]Python program for word count, average word length, word frequency and frequency of words starting with letters of the alphabet

Need to write a Python program that analyzes a file and counts: 需要编写一个Python程序来分析文件并计数:

  • The number of words 字数
  • The average length of a word 一个单词的平均长度
  • How many times each word occurs 每个单词出现多少次
  • How many words start with each letter of the alphabet 字母表中的每个字母开头多少个单词

I've got the code to do the first 2 things: 我有执行前两件事的代码:

with open(input('Please enter the full name of the file: '),'r') as f:
     w = [len(word) for line in f for word in line.rstrip().split(" ")]
     total_w = len(w)
     avg_w = sum(w)/total_w

print('The total number of words in this file is:', total_w)
print('The average length of the words in this file is:', avg_w)

But I'm not sure on how to do the others. 但是我不确定其他人该怎么做。 Any help is appreciated. 任何帮助表示赞赏。

Btw, when I say "How many words start with each letter of the alphabet" I mean how many words start with "A", how many start with "B", how many start with "C", etc all the way through to "Z". 顺便说一句,当我说“有多少个单词以字母的每个字母开头”时,我的意思是有多少个单词以“ A”开头,有多少个以“ B”开头,有多少个以“ C”开头等等,一直到“Z”。

Interesting challenge you were given, i made a proposition for question 3, how many times a word occurs inside the string. 给了您有趣的挑战,我对问题3提出了一个命题,即单词在字符串中出现了多少次。 This code is not optimal at all, but it does work. 该代码根本不是最佳代码,但它确实有效。

also i used the file text.txt 我也使用了文件text.txt

edit: noticed i forgot to create wordlist as it was saved in my ram memory 编辑:注意到我忘记创建单词表,因为它保存在我的内存中

with open('text.txt', 'r') as doc:
    print('opened txt')
    for words in doc:
        wordlist = words.split()     

for numbers in range(len(wordlist)):
        for inner_numbers in range(len(wordlist)):
            if inner_numbers != numbers:
                if wordlist[numbers] == wordlist[inner_numbers]:
                    print('word: %s == %s' %(wordlist[numbers], wordlist[inner_numbers]))

Answer to question four: This one wasn't really hard after you have created a list with all the words since strings can be treated like a list and you can easily get the first letter of the string by simply doing string[0] and if its a list with strings stringList[position of word][0] 问题四的答案:在您创建了包含所有单词的列表之后,这并不难,因为可以将字符串视为列表,并且只需执行string[0]即可轻松获得字符串的第一个字母,如果它是一个包含字符串的列表stringList[position of word][0]

for numbers in range(len(wordlist)):
        if wordlist[numbers][0] == 'a':
            print(wordlist[numbers])

There are many ways to achieve this, a more advanced approach would involve an initial simple gathering of the text and its words, then working on the data with ML/DS tools, with which you could extrapolate more statistics (Things like "a new paragraph starts mostly with X words" / "X words are mostly preceeded/succeeded by Y words" etc.) 有很多方法可以实现这一点,一种更高级的方法是首先简单地收集文本和单词,然后使用ML / DS工具处理数据,利用该工具可以推断出更多的统计信息(例如“新段落”大多以X词开头” /“ X词大多以Y词开头/成功”等)

If you just need very basic statistics you can gather them while iterating over each word and do the calculations at the end of it, like: 如果您只需要非常基本的统计信息,则可以在遍历每个单词的同时收集它们并在其末尾进行计算,例如:

stats = {
  'amount': 0,
  'length': 0,
  'word_count': {},
  'initial_count': {}
}

with open('lorem.txt', 'r') as f:
  for line in f:
    line = line.strip()
    if not line:
      continue
    for word in line.split():
      word = word.lower()
      initial = word[0]

      # Add word and length count
      stats['amount'] += 1
      stats['length'] += len(word)

      # Add initial count
      if not initial in stats['initial_count']:
        stats['initial_count'][initial] = 0
      stats['initial_count'][initial] += 1

      # Add word count
      if not word in stats['word_count']:
        stats['word_count'][word] = 0
      stats['word_count'][word] += 1

# Calculate average word length
stats['average_length'] = stats['length'] / stats['amount']

Online Demo here 此处在线演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM