Preffered way of counting lines, characters and words from a file as a whole in Python

Question

I have found 2 ways of counting the lines of a file as they can be seen below. (note: I need to read the file as a whole and not line-by-line)

Trying to get a feel of which approach is better in terms of efficiency and/or good-coding-style.

names = {} 
for each_file in glob.glob('*.cpp'):
    with open(each_file) as f:
        names[each_file] = sum(1 for line in f if line.strip())

(as seen here )

data = open('test.cpp', 'r').read()
print(len(data.splitlines()), len(data.split()), len(data))

(as seen here )

And in the same topic, regarding the counting the number of characters and the counting number of words in a file; is there a better way than the one suggested above?

Answer 1

Use a generator expression for memory efficiency (this approach will avoid reading the whole file into memory). Here's a demonstration.

def count(filename, what):
    strategy = {'lines': lambda x: bool(x.strip()),
                'words': lambda x: len(x.split()),
                'chars': len
    }

    strat = strategy[what]
    with open(filename) as f:
        return sum(strat(line) for line in f)

input.txt:

this is
a test file
i just typed

output:

>>> count('input.txt', 'lines')
3
>>> count('input.txt', 'words')
8
>>> count('input.txt', 'chars')
33

Note that when counting characters, the newline characters are counted as well. Also note that this uses a pretty crude definition of "word" (you did not provide one), it just splits a line by whitespace and counts the elements of the returned list.

Answer 2

Create a few test files and test them in a big loop to see the average times. Make sure the test files fit your scenarios.

I used this code:

import glob
import time

times1 = []
for i in range(0,1000):
    names = {} 
    t0 = time.clock()
    with open("lines.txt") as f:
        names["lines.txt"] = sum(1 for line in f if line.strip())
        print names
    times1.append(time.clock()-t0)

times2 = []
for i in range(0,1000):
    names = {} 
    t0 = time.clock()
    data = open("lines.txt", 'r').read()
    print("lines.txt",len(data.splitlines()), len(data.split()), len(data))

    times2.append(time.clock()-t0)


print sum(times1)/len(times1)
print sum(times2)/len(times2)

and came out with the average timings: 0.0104755582104 and 0.0180650466201 seconds

This was on a text file with 23000 lines. Eg:

print("lines.txt",len(data.splitlines()), len(data.split()), len(data))

outputs: ('lines.txt', 23056, 161392, 1095160)

Test this on your actual file set to get more accurate timing data.

Preffered way of counting lines, characters and words from a file as a whole in Python

Question

2 answers

solution1
6 ACCPTED 2016-04-10 19:30:58

solution2
4 2016-04-10 19:01:42

Preffered way of counting lines, characters and words from a file as a whole in Python

Question

2 answers

solution1 6 ACCPTED 2016-04-10 19:30:58

solution2 4 2016-04-10 19:01:42

solution1
6 ACCPTED 2016-04-10 19:30:58

solution2
4 2016-04-10 19:01:42