简体   繁体   中英

Python count of words by word length

I was given a.txt file with a text. I have already cleaned the text (removed punctuation, uppercase, symbols), and now I have a string with the words. I am now trying to get the count of characters len() of each item on the string. Then make a plot where N of characters is on the X-axis and the Y-axis is the number of words that have such N len() of characters

So far I have:

text = "sample.txt"

def count_chars(txt):
    result = 0
    for char in txt:
        result += 1     # same as result = result + 1
    return result

print(count_chars(text))

So far this is looking for the total len() of the text instead of by word.

I would like to get something like the function Counter Counter() this returns the word with the count of how many times it repeated throughout the text.

from collections import Counter
word_count=Counter(text)

I want to get the # of characters per word. Once we have such a count the plotting should be easier.

Thanks and anything helps!

Okay, first of all you need to open the sample.txt file.

with open('sample.txt', 'r') as text_file:
    text = text_file.read()

or

text = open('sample.txt', 'r').read()

Now we can count the words in the text and put it, for example, in a dict.

counter_dict = {}
for word in text.split(" "):
    counter_dict[word] = len(word)
print(counter_dict)

It looks like the accepted answer doesn't solve the problem as it was posed by the querent

Then make a plot where N of characters is on the X-axis and the Y-axis is the number of words that have such N len() of characters

import matplotlib.pyplot as plt

# ch10 = ... the text of "Moby Dick"'s chapter 10, as found
# in https://www.gutenberg.org/files/2701/2701-h/2701-h.htm

# split chap10 into a list of words,
words = [w for w in ch10.split() if w]
# some words are joined by an em-dash
words = sum((w.split('—') for w in words), [])
# remove suffixes and one prefix
for suffix in (',','.',':',';','!','?','"'):
    words = [w.removesuffix(suffix) for w in words]
words = [w.removeprefix('"') for w in words]

# count the different lenghts using a dict
d = {}
for w in words:
    l = len(w)
    d[l] = d.get(l, 0) + 1

# retrieve the relevant info from the dict 
lenghts, counts = zip(*d.items())

# plot the relevant info
plt.bar(lenghts, counts)
plt.xticks(range(1, max(lenghts)+1))
plt.xlabel('Word lengths')
plt.ylabel('Word counts')
# what is the longest word?
plt.title(' '.join(w for w in words if len(w)==max(lenghts)))

# T H E   E N D

plt.show()

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM