Python program for word count, average word length, word frequency and frequency of words starting with letters of the alphabet

Question

Need to write a Python program that analyzes a file and counts:

The number of words
The average length of a word
How many times each word occurs
How many words start with each letter of the alphabet

I've got the code to do the first 2 things:

with open(input('Please enter the full name of the file: '),'r') as f:
     w = [len(word) for line in f for word in line.rstrip().split(" ")]
     total_w = len(w)
     avg_w = sum(w)/total_w

print('The total number of words in this file is:', total_w)
print('The average length of the words in this file is:', avg_w)

But I'm not sure on how to do the others. Any help is appreciated.

Btw, when I say "How many words start with each letter of the alphabet" I mean how many words start with "A", how many start with "B", how many start with "C", etc all the way through to "Z".

Answer 1

Interesting challenge you were given, i made a proposition for question 3, how many times a word occurs inside the string. This code is not optimal at all, but it does work.

also i used the file text.txt

edit: noticed i forgot to create wordlist as it was saved in my ram memory

with open('text.txt', 'r') as doc:
    print('opened txt')
    for words in doc:
        wordlist = words.split()     

for numbers in range(len(wordlist)):
        for inner_numbers in range(len(wordlist)):
            if inner_numbers != numbers:
                if wordlist[numbers] == wordlist[inner_numbers]:
                    print('word: %s == %s' %(wordlist[numbers], wordlist[inner_numbers]))

Answer to question four: This one wasn't really hard after you have created a list with all the words since strings can be treated like a list and you can easily get the first letter of the string by simply doing string[0] and if its a list with strings stringList[position of word][0]

for numbers in range(len(wordlist)):
        if wordlist[numbers][0] == 'a':
            print(wordlist[numbers])

Answer 2

There are many ways to achieve this, a more advanced approach would involve an initial simple gathering of the text and its words, then working on the data with ML/DS tools, with which you could extrapolate more statistics (Things like "a new paragraph starts mostly with X words" / "X words are mostly preceeded/succeeded by Y words" etc.)

If you just need very basic statistics you can gather them while iterating over each word and do the calculations at the end of it, like:

stats = {
  'amount': 0,
  'length': 0,
  'word_count': {},
  'initial_count': {}
}

with open('lorem.txt', 'r') as f:
  for line in f:
    line = line.strip()
    if not line:
      continue
    for word in line.split():
      word = word.lower()
      initial = word[0]

      # Add word and length count
      stats['amount'] += 1
      stats['length'] += len(word)

      # Add initial count
      if not initial in stats['initial_count']:
        stats['initial_count'][initial] = 0
      stats['initial_count'][initial] += 1

      # Add word count
      if not word in stats['word_count']:
        stats['word_count'][word] = 0
      stats['word_count'][word] += 1

# Calculate average word length
stats['average_length'] = stats['length'] / stats['amount']

Online Demo here

Python program for word count, average word length, word frequency and frequency of words starting with letters of the alphabet

Question

2 answers

solution1
0 2018-08-26 20:07:51

solution2
0 ACCPTED 2018-08-26 21:06:29

Python program for word count, average word length, word frequency and frequency of words starting with letters of the alphabet

Question

2 answers

solution1 0 2018-08-26 20:07:51

solution2 0 ACCPTED 2018-08-26 21:06:29

solution1
0 2018-08-26 20:07:51

solution2
0 ACCPTED 2018-08-26 21:06:29