简体   繁体   English

Python:如何加快此文件的加载

[英]Python : how to speed up this file loading

I am seeking for a way to speed up a file loading like this : 我正在寻找一种方法来加快文件加载,如下所示:

The data contains about 1 millions lines, tab separated with "\\t" (tabulation char) and utf8 encoding, it takes about 9 seconds to parse the full file with the code below. 数据包含大约一百万行,用“ \\ t”(制表符)和utf8编码分隔制表符,用下面的代码解析整个文件大约需要9秒钟。 However, I would have like to be almost in an order of a second! 但是,我想快一点!

def load(filename):
    features = []
    with codecs.open(filename, 'rb',  'utf-8') as f:
        previous = ""
        for n, s in enumerate(f):
            splitted = tuple(s.rstrip().split("\t"))
            if len(splitted) != 2:
                sys.exit("wrong format!")
            if previous >= splitted:
                sys.exit("unordered feature")
            previous = splitted
            features.append(splitted)
    return features   

I am wondering if any binary format data could speed up something? 我想知道是否任何二进制格式的数据可以加快速度? Or if I could benefit from a some NumPy or any other libraries to have faster loading speed. 或者,如果我可以从某些NumPy或任何其他库中受益,以加快加载速度。

Maybe you could give me advice on another speed bottleneck? 也许您可以给我另一个速度瓶颈方面的建议?

EDIT: so i try some of your ideas, thanks! 编辑:所以我尝试了您的一些想法,谢谢! BTW i really need the tuple (string, string) inside the huge list... here are the results, i'm gaining 50% of the time :) now i am going to look after the NumPy binary data, as i have noticed that another huge file was really really quick to load... 顺便说一句,我真的需要巨大列表中的元组(字符串,字符串)...这是结果,我获得了50%的时间:)现在,我要照顾NumPy二进制数据,正如我注意到的那样另一个非常大的文件真的非常快加载...

import codecs

def load0(filename): 
    with codecs.open(filename, 'rb',  'utf-8') as f: 
    return f.readlines() 

def load1(filename): 
    with codecs.open(filename, 'rb',  'utf-8') as f: 
    return [tuple(x.rstrip().split("\t")) for x in f.readlines()]

def load3(filename):
    features = []
    with codecs.open(filename, 'rb',  'utf-8') as f:
    for n, s in enumerate(f):
        splitted = tuple(s.rstrip().split("\t"))
        features.append(splitted)
    return features

def load4(filename): 
    with codecs.open(filename, 'rb',  'utf-8') as f: 
    for s in f: 
        yield tuple(s.rstrip().split("\t")) 

a = datetime.datetime.now()
r0 = load0(myfile)
b = datetime.datetime.now()
print "f.readlines(): %s" % (b-a)

a = datetime.datetime.now()
r1 = load1(myfile)
b = datetime.datetime.now()
print """[tuple(x.rstrip().split("\\t")) for x in f.readlines()]: %s""" % (b-a)

a = datetime.datetime.now()
r3 = load3(myfile)
b = datetime.datetime.now()
print """load3: %s""" % (b-a)
if r1 == r3: print "OK: speeded and similars!"

a = datetime.datetime.now()
r4 = [x for x in load4(myfile)] 
b = datetime.datetime.now()
print """load4: %s""" % (b-a)
if r4 == r3: print "OK: speeded and similars!"

results : 结果:

f.readlines(): 0:00:00.208000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:02.310000
load3: 0:00:07.883000
OK: speeded and similars!
load4: 0:00:07.943000
OK: speeded and similars!

something very strange is that i notice that i can have almost double time on two consecutive runs (but not everytime) : 非常奇怪的是,我注意到我可以连续两次运行(但并非每次都可以)运行近两倍的时间:

>>> ================================ RESTART ================================
>>> 
f.readlines(): 0:00:00.220000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:02.479000
load3: 0:00:08.288000
OK: speeded and similars!
>>> ================================ RESTART ================================
>>> 
f.readlines(): 0:00:00.279000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:04.983000
load3: 0:00:10.404000
OK: speeded and similars!

EDIT LATEST: well i tried to modify to use the numpy.load ... it is very strange to me... from "normal" file with my 1022860 strings, and 10 KB. 最新编辑:好吧,我尝试修改以使用numpy.load ...这对我来说很奇怪...来自带有1022860字符串和10 KB的“普通”文件。 After doing a numpy.save(numpy.array(load1(myfile))) i went to a 895 MB ! 做完numpy.save(numpy.array(load1(myfile)))我去了895 MB! an then reloading this with numpy.load() i get this kind of timing on consecutive runs : 然后使用numpy.load()重新加载它,我在连续运行时得到这种计时:

  >>> ================================ RESTART ================================
  loading: 0:00:11.422000 done.
  >>> ================================ RESTART ================================
  loading: 0:00:00.759000 done.

may be does numpy do some memory stuff to avoid future reload? numpy会做一些内存工作以避免将来重新加载吗?

Try this version, since you mentioned the checking wasn't important I have eliminated it. 尝试此版本,因为您提到检查并不重要,所以我已将其删除。

def load(filename):
    with codecs.open(filename, 'rb',  'utf-8') as f:
        for s in f:
            yield tuple(s.rstrip().split("\t"))

results = [x for x in load('somebigfile.txt')]

check how many seconds is to actually read the lines of the file, like 检查实际读取文件行需要多少秒,例如

def load(filename):
    features = []
    with codecs.open(filename, 'rb',  'utf-8') as f:
        return f.readlines()

If it is significantly less then 9 sec, then 如果明显少于9秒,则

  1. try other to use multiprocessing and split the work of checking lines between cpu cores and/or 尝试其他方式使用多处理,并将检查CPU核心和/或
  2. use faster interpreter like pypy 使用像pypy这样的更快的解释器

and see if any of these speed things up 看看这些速度是否加快了

Having checked how long does it take to just iterate over the file, as bpgergo suggests, you can check the following: 按照bpgergo建议,检查了遍历文件bpgergo ,您可以检查以下内容:

  • If you know that your file contains 10^6 rows, you could preallocate the list. 如果您知道文件包含10 ^ 6行,则可以预分配列表。 It should be faster than appending to it in each iteration. 它应该比每次迭代中附加到它的速度更快。 Just use features = [None] * (10 ** 6) to initialize your list 只需使用features = [None] * (10 ** 6)即可初始化您的列表
  • Don't cast the result of split() onto tuple, it doesn't seem necessary. 不要将split()的结果强制转换为元组,这似乎没有必要。
  • You don't seem to benefit from enumerate at all. 您似乎根本没有从enumerate中受益。 Just use: for line in f: instead of for n, s in enumerate(f): 只需for line in f:使用::而不是for n, s in enumerate(f):

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM