简体   繁体   中英

counting the word length in a file

So my function should open a file and count the word length and give the output. For example,

many('sample.txt')

Words of length 1: 2

Words of length 2: 6

Words of length 3: 7

Words of length 4: 6

My sample.txt file contains: This is a test file. How many words are of length one? How many words are of length three? We should figure it out! Can a function do this?

My coding so far,

def many(fname): infile = open(fname,'r')
text = infile.read()
infile.close()
L = text.split()
L.sort
for item in L:
    if item == 1:
        print('Words of length 1:', L.count(item))

Can anyone tell me what I'm doing wrong. I call the function nothing happens. It's clearly because of my coding but I don't know where to go from here. Any help would be nice, thanks.

Since this is homework, I'll post a short solution here, and leave it as exercise to figure out what it does and why it works :)

>>> from collections import Counter
>>> text = open("sample.txt").read()
>>> counts = Counter([len(word.strip('?!,.')) for word in text.split()])
>>> counts[3]
7

You want to obtain a list of lengths (1, 2, 3, 4,... characters) and a number of occurrences of words with this length in the file.

So until L = text.split() it was a good approach. Now have a look at dictionaries in Python, that will allow you to store the data structure mentioned above and iterate over the list of words in the file. Just a hint...

What do you expect here

if item == 1:

and here

L.count(item)

And what does actually happen? Use a debugger and have a look at the variable values or just print them to the screen.

Maybe also this:

>>> s
'This is a test file. How many words are of length one? How many words are of length three? We should figure it out! Can a function do this?'
>>> {x:[len([c for c in w ]) for w in s.split()].count(x) for x in [len([c for c in w ]) for w in s.split()] }
{1: 2, 2: 6, 3: 5, 4: 6, 5: 4, 6: 5, 8: 1}

Let's analyze your problem step-by-step.

You need to:

  1. Retrieve all the words from a file
  2. Iterate over all the words
  3. Increment the counter N every time you find a word of length N
  4. Output the result

You already did the step 1:

def many(fname): 
    infile = open(fname,'r')
    text = infile.read()
    infile.close()
    L = text.split()

Then you (try to) sort the words, but it is not useful. You would sort them alphanumerically, so it is not useful for your task.

Instead, let's define a Python dictionary to hold the count of words

    lengths = dict()

@sukhbir correctly suggested in a comment to use the Counter class, and I encourage you to go and search for it, but I'll stick to traditional dictionaries in this example as i find it important to familiarize with the basics of the language before exploring the library.

Let's go on with step 2:

    for word in L:
        length = len(word)

For each word in the list, we assign to the variable length the length of the current word. Let's check if the counter already has a slot for our length:

        if length not in lengths:
            lengths[length] = 0

If no word of length length was encountered, we allocate that slot and we set that to zero. We can finally execute step 3:

        lengths[length] += 1

Finally, we incremented the counter of words with the current length of 1 unit.

At the end of the function, you'll find that lengths will contain a map of word length -> number of words of that length . Let's verify that by printing its contents (step 4):

    for length, counter in lengths.items():
        print "Words of length %d: %d" % (length, counter)

If you copy and paste the code I wrote (respecting the indentation!!) you will get the answers you need.

I strongly suggest you to go through the Python tutorial .

The regular expression library might also be helpful, if being somewhat overkill. A simple word matching re might be something like:

import re
f = open("sample.txt")
text = f.read()
words = re.findall("\w+", text)

Words is then a list of... words :)

This however will not properly match words like 'isn't' and 'I'm', as \\w only matches alphanumerics. In the spirit of this being homework I guess I'll leave that for the interested reader, but Python Regular Expression documentation is pretty good as a start.

Then my approach for counting these words by length would be something like:

occurrence = dict()
for word in words:
    try:
        occurrence[len(word)] = occurrence[len(word)] + 1
    except KeyError:
        occurrence[len(word)] = 1
print occurrence.items()

Where a dictionary (occurrence) is used to store the word lengths and their occurrence in your text. The try: and except: keywords deal with the first time we try and store a particular length of word in the dictionary, where in this case the dictionary is not happy at being asked to retrieve something that it has no knowledge of, and the except: picks up the exception that is thrown as a result and stores the first occurrence of that length of word. The last line prints everything in your dictionary.

Hope this helps :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM