在Python中優先計算文件中的行，字符和單詞的方法

Question

我已經找到了兩種計算文件行的方法，如下所示。 （注意：我需要整個文件讀取而不是逐行讀取）

試圖了解哪種方法在效率和/或良好編碼風格方面更好。

names = {} 
for each_file in glob.glob('*.cpp'):
    with open(each_file) as f:
        names[each_file] = sum(1 for line in f if line.strip())

（如看到這里）

data = open('test.cpp', 'r').read()
print(len(data.splitlines()), len(data.split()), len(data))

（如看到這里）

在同一主題中，關於計算文件中的字符 數和單詞的計數數 ; 有沒有比上面建議更好的方法？

Answer 1

使用生成器表達式來提高內存效率（這種方法可以避免將整個文件讀入內存）。 這是一個演示。

def count(filename, what):
    strategy = {'lines': lambda x: bool(x.strip()),
                'words': lambda x: len(x.split()),
                'chars': len
    }

    strat = strategy[what]
    with open(filename) as f:
        return sum(strat(line) for line in f)

input.txt中：

this is
a test file
i just typed

輸出：

>>> count('input.txt', 'lines')
3
>>> count('input.txt', 'words')
8
>>> count('input.txt', 'chars')
33

請注意，計算字符時，也會計算換行符。 還要注意，這使用了一個相當粗略的“word”定義（你沒有提供一個），它只是按空格分割一行，並計算返回列表的元素。

Answer 2

創建一些測試文件並在大循環中測試它們以查看平均時間。 確保測試文件適合您的方案。

我用過這段代碼：

import glob
import time

times1 = []
for i in range(0,1000):
    names = {} 
    t0 = time.clock()
    with open("lines.txt") as f:
        names["lines.txt"] = sum(1 for line in f if line.strip())
        print names
    times1.append(time.clock()-t0)

times2 = []
for i in range(0,1000):
    names = {} 
    t0 = time.clock()
    data = open("lines.txt", 'r').read()
    print("lines.txt",len(data.splitlines()), len(data.split()), len(data))

    times2.append(time.clock()-t0)


print sum(times1)/len(times1)
print sum(times2)/len(times2)

並得出平均時間：0.0104755582104和0.0180650466201秒

這是一個23000行的文本文件。 例如：

print("lines.txt",len(data.splitlines()), len(data.split()), len(data))

輸出：（'lines.txt'，23056,161392,1095160）

在您的實際文件集上進行測試，以獲得更准確的計時數據。

在Python中優先計算文件中的行，字符和單詞的方法

問題描述

2 個解決方案

解決方案1
6 已采納 2016-04-10 19:30:58

解決方案2
4 2016-04-10 19:01:42

在Python中優先計算文件中的行，字符和單詞的方法

問題描述

2 個解決方案

解決方案1 6 已采納 2016-04-10 19:30:58

解決方案2 4 2016-04-10 19:01:42

解決方案1
6 已采納 2016-04-10 19:30:58

解決方案2
4 2016-04-10 19:01:42