Python：如何加快此文件的加载

Question

I am seeking for a way to speed up a file loading like this : 我正在寻找一种方法来加快文件加载，如下所示：

The data contains about 1 millions lines, tab separated with "\\t" (tabulation char) and utf8 encoding, it takes about 9 seconds to parse the full file with the code below. 数据包含大约一百万行，用“ \\ t”（制表符）和utf8编码分隔制表符，用下面的代码解析整个文件大约需要9秒钟。 However, I would have like to be almost in an order of a second! 但是，我想快一点！

def load(filename):
    features = []
    with codecs.open(filename, 'rb',  'utf-8') as f:
        previous = ""
        for n, s in enumerate(f):
            splitted = tuple(s.rstrip().split("\t"))
            if len(splitted) != 2:
                sys.exit("wrong format!")
            if previous >= splitted:
                sys.exit("unordered feature")
            previous = splitted
            features.append(splitted)
    return features

I am wondering if any binary format data could speed up something? 我想知道是否任何二进制格式的数据可以加快速度？ Or if I could benefit from a some NumPy or any other libraries to have faster loading speed. 或者，如果我可以从某些NumPy或任何其他库中受益，以加快加载速度。

Maybe you could give me advice on another speed bottleneck? 也许您可以给我另一个速度瓶颈方面的建议？

EDIT: so i try some of your ideas, thanks! 编辑：所以我尝试了您的一些想法，谢谢！ BTW i really need the tuple (string, string) inside the huge list... here are the results, i'm gaining 50% of the time :) now i am going to look after the NumPy binary data, as i have noticed that another huge file was really really quick to load... 顺便说一句，我真的需要巨大列表中的元组（字符串，字符串）...这是结果，我获得了50％的时间:)现在，我要照顾NumPy二进制数据，正如我注意到的那样另一个非常大的文件真的非常快加载...

import codecs

def load0(filename): 
    with codecs.open(filename, 'rb',  'utf-8') as f: 
    return f.readlines() 

def load1(filename): 
    with codecs.open(filename, 'rb',  'utf-8') as f: 
    return [tuple(x.rstrip().split("\t")) for x in f.readlines()]

def load3(filename):
    features = []
    with codecs.open(filename, 'rb',  'utf-8') as f:
    for n, s in enumerate(f):
        splitted = tuple(s.rstrip().split("\t"))
        features.append(splitted)
    return features

def load4(filename): 
    with codecs.open(filename, 'rb',  'utf-8') as f: 
    for s in f: 
        yield tuple(s.rstrip().split("\t")) 

a = datetime.datetime.now()
r0 = load0(myfile)
b = datetime.datetime.now()
print "f.readlines(): %s" % (b-a)

a = datetime.datetime.now()
r1 = load1(myfile)
b = datetime.datetime.now()
print """[tuple(x.rstrip().split("\\t")) for x in f.readlines()]: %s""" % (b-a)

a = datetime.datetime.now()
r3 = load3(myfile)
b = datetime.datetime.now()
print """load3: %s""" % (b-a)
if r1 == r3: print "OK: speeded and similars!"

a = datetime.datetime.now()
r4 = [x for x in load4(myfile)] 
b = datetime.datetime.now()
print """load4: %s""" % (b-a)
if r4 == r3: print "OK: speeded and similars!"

results : 结果：

f.readlines(): 0:00:00.208000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:02.310000
load3: 0:00:07.883000
OK: speeded and similars!
load4: 0:00:07.943000
OK: speeded and similars!

something very strange is that i notice that i can have almost double time on two consecutive runs (but not everytime) : 非常奇怪的是，我注意到我可以连续两次运行（但并非每次都可以）运行近两倍的时间：

>>> ================================ RESTART ================================
>>> 
f.readlines(): 0:00:00.220000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:02.479000
load3: 0:00:08.288000
OK: speeded and similars!
>>> ================================ RESTART ================================
>>> 
f.readlines(): 0:00:00.279000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:04.983000
load3: 0:00:10.404000
OK: speeded and similars!

EDIT LATEST: well i tried to modify to use the numpy.load ... it is very strange to me... from "normal" file with my 1022860 strings, and 10 KB. 最新编辑：好吧，我尝试修改以使用numpy.load ...这对我来说很奇怪...来自带有1022860字符串和10 KB的“普通”文件。 After doing a numpy.save(numpy.array(load1(myfile))) i went to a 895 MB ! 做完numpy.save(numpy.array(load1(myfile)))我去了895 MB！ an then reloading this with numpy.load() i get this kind of timing on consecutive runs : 然后使用numpy.load()重新加载它，我在连续运行时得到这种计时：

  >>> ================================ RESTART ================================
  loading: 0:00:11.422000 done.
  >>> ================================ RESTART ================================
  loading: 0:00:00.759000 done.

may be does numpy do some memory stuff to avoid future reload? numpy会做一些内存工作以避免将来重新加载吗？

Answer 1

Try this version, since you mentioned the checking wasn't important I have eliminated it. 尝试此版本，因为您提到检查并不重要，所以我已将其删除。

def load(filename):
    with codecs.open(filename, 'rb',  'utf-8') as f:
        for s in f:
            yield tuple(s.rstrip().split("\t"))

results = [x for x in load('somebigfile.txt')]

Answer 2

check how many seconds is to actually read the lines of the file, like 检查实际读取文件行需要多少秒，例如

def load(filename):
    features = []
    with codecs.open(filename, 'rb',  'utf-8') as f:
        return f.readlines()

If it is significantly less then 9 sec, then 如果明显少于9秒，则

try other to use multiprocessing and split the work of checking lines between cpu cores and/or 尝试其他方式使用多处理，并将检查CPU核心和/或
use faster interpreter like pypy 使用像pypy这样的更快的解释器

and see if any of these speed things up 看看这些速度是否加快了

Answer 3

Having checked how long does it take to just iterate over the file, as bpgergo suggests, you can check the following: 按照bpgergo建议，检查了遍历文件bpgergo ，您可以检查以下内容：

If you know that your file contains 10^6 rows, you could preallocate the list. 如果您知道文件包含10 ^ 6行，则可以预分配列表。 It should be faster than appending to it in each iteration. 它应该比每次迭代中附加到它的速度更快。 Just use features = [None] * (10 ** 6) to initialize your list 只需使用features = [None] * (10 ** 6)即可初始化您的列表
Don't cast the result of split() onto tuple, it doesn't seem necessary. 不要将split()的结果强制转换为元组，这似乎没有必要。
You don't seem to benefit from enumerate at all. 您似乎根本没有从enumerate中受益。 Just use: for line in f: instead of for n, s in enumerate(f): 只需for line in f:使用::而不是for n, s in enumerate(f):

Python：如何加快此文件的加载

问题描述

3 个解决方案

解决方案1
2 2012-09-03 16:15:33

解决方案2
1 已采纳 2012-09-03 15:55:38

解决方案3
1 2012-09-03 15:59:08

Python：如何加快此文件的加载

问题描述

3 个解决方案

解决方案1 2 2012-09-03 16:15:33

解决方案2 1 已采纳 2012-09-03 15:55:38

解决方案3 1 2012-09-03 15:59:08

解决方案1
2 2012-09-03 16:15:33

解决方案2
1 已采纳 2012-09-03 15:55:38

解决方案3
1 2012-09-03 15:59:08