简体   繁体   中英

Trying to count words in a file using Python

I am attempting to count the number of 'difficult words' in a file, which requires me to count the number of letters in each word. For now, I am only trying to get single words, one at a time, from a file. I've written the following:

file = open('infile.txt', 'r+')
fileinput = file.read()

for line in fileinput:
    for word in line.split():
        print(word)

Output:

t
h
e

o
r
i
g
i
n

.
.
.

It seems to be printing one character at a time instead of one word at a time. I'd really like to know more about what is actually happening here. Any suggestions?

Use splitlines() :

fopen = open('infile.txt', 'r+')
fileinput = fopen.read()

for line in fileinput.splitlines():
    for word in line.split():
        print(word)

fopen.close()

Without splitlines() :

You can also use with statement to open the file. It closes the file automagically:

with open('infile.txt', 'r+') as fopen:
    for line in fopen:
        for word in line.split():
            print(word)

A file supports the iteration protocol, which for bigger files is much better than reading the whole content in memory in one go

with open('infile.txt', 'r+') as f:
    for line in f:
        for word in line.split():
            print(word)

Assuming you are going to define a filter function, you could do something along the line

def is_difficult(word):
    return len(word)>5

with open('infile.txt', 'r+') as f:
    words = (w for line in f for w in line.split() if is_difficult(w))
    for w in words:
        print(w)

which, with an input file of

ciao come va
oggi meglio di domani
ieri peggio di oggi

produces

meglio
domani
peggio

Your code is giving you single characters because you called .read() which store all the content as a single string so when you for line in fileinput you are iterating over the string char by char, there is no good reason to use read and splitlines you as can simple iterate over the file object, if you did want a list of lines you would call readlines .

If you want to group words by length use a dict using the length of the word as the key, you will want to also remove punctuation from the words which you can do with str.strip:

def words(n, fle):
    from collections import defaultdict
    d = defaultdict(list)
    from string import punctuation
    with open(fle) as f:
        for line in f:
            for word in line.split():
                word = word.strip(punctuation)
                _len = len(word)
                if _len >= n:
                    d[_len].append(word)
    return d

Your dict will contain all the words in the file grouped by length and all at least n characters long.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM