[英]Python : how to speed up this file loading
I am seeking for a way to speed up a file loading like this : 我正在寻找一种方法来加快文件加载,如下所示:
The data contains about 1 millions lines, tab separated with "\\t" (tabulation char) and utf8 encoding, it takes about 9 seconds to parse the full file with the code below. 数据包含大约一百万行,用“ \\ t”(制表符)和utf8编码分隔制表符,用下面的代码解析整个文件大约需要9秒钟。 However, I would have like to be almost in an order of a second!
但是,我想快一点!
def load(filename):
features = []
with codecs.open(filename, 'rb', 'utf-8') as f:
previous = ""
for n, s in enumerate(f):
splitted = tuple(s.rstrip().split("\t"))
if len(splitted) != 2:
sys.exit("wrong format!")
if previous >= splitted:
sys.exit("unordered feature")
previous = splitted
features.append(splitted)
return features
I am wondering if any binary format data could speed up something? 我想知道是否任何二进制格式的数据可以加快速度? Or if I could benefit from a some
NumPy
or any other libraries to have faster loading speed. 或者,如果我可以从某些
NumPy
或任何其他库中受益,以加快加载速度。
Maybe you could give me advice on another speed bottleneck? 也许您可以给我另一个速度瓶颈方面的建议?
EDIT: so i try some of your ideas, thanks! 编辑:所以我尝试了您的一些想法,谢谢! BTW i really need the tuple (string, string) inside the huge list... here are the results, i'm gaining 50% of the time :) now i am going to look after the NumPy binary data, as i have noticed that another huge file was really really quick to load...
顺便说一句,我真的需要巨大列表中的元组(字符串,字符串)...这是结果,我获得了50%的时间:)现在,我要照顾NumPy二进制数据,正如我注意到的那样另一个非常大的文件真的非常快加载...
import codecs
def load0(filename):
with codecs.open(filename, 'rb', 'utf-8') as f:
return f.readlines()
def load1(filename):
with codecs.open(filename, 'rb', 'utf-8') as f:
return [tuple(x.rstrip().split("\t")) for x in f.readlines()]
def load3(filename):
features = []
with codecs.open(filename, 'rb', 'utf-8') as f:
for n, s in enumerate(f):
splitted = tuple(s.rstrip().split("\t"))
features.append(splitted)
return features
def load4(filename):
with codecs.open(filename, 'rb', 'utf-8') as f:
for s in f:
yield tuple(s.rstrip().split("\t"))
a = datetime.datetime.now()
r0 = load0(myfile)
b = datetime.datetime.now()
print "f.readlines(): %s" % (b-a)
a = datetime.datetime.now()
r1 = load1(myfile)
b = datetime.datetime.now()
print """[tuple(x.rstrip().split("\\t")) for x in f.readlines()]: %s""" % (b-a)
a = datetime.datetime.now()
r3 = load3(myfile)
b = datetime.datetime.now()
print """load3: %s""" % (b-a)
if r1 == r3: print "OK: speeded and similars!"
a = datetime.datetime.now()
r4 = [x for x in load4(myfile)]
b = datetime.datetime.now()
print """load4: %s""" % (b-a)
if r4 == r3: print "OK: speeded and similars!"
results : 结果:
f.readlines(): 0:00:00.208000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:02.310000
load3: 0:00:07.883000
OK: speeded and similars!
load4: 0:00:07.943000
OK: speeded and similars!
something very strange is that i notice that i can have almost double time on two consecutive runs (but not everytime) : 非常奇怪的是,我注意到我可以连续两次运行(但并非每次都可以)运行近两倍的时间:
>>> ================================ RESTART ================================
>>>
f.readlines(): 0:00:00.220000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:02.479000
load3: 0:00:08.288000
OK: speeded and similars!
>>> ================================ RESTART ================================
>>>
f.readlines(): 0:00:00.279000
[tuple(x.rstrip().split("\t")) for x in f.readlines()]: 0:00:04.983000
load3: 0:00:10.404000
OK: speeded and similars!
EDIT LATEST: well i tried to modify to use the numpy.load
... it is very strange to me... from "normal" file with my 1022860 strings, and 10 KB. 最新编辑:好吧,我尝试修改以使用
numpy.load
...这对我来说很奇怪...来自带有1022860字符串和10 KB的“普通”文件。 After doing a numpy.save(numpy.array(load1(myfile)))
i went to a 895 MB ! 做完
numpy.save(numpy.array(load1(myfile)))
我去了895 MB! an then reloading this with numpy.load()
i get this kind of timing on consecutive runs : 然后使用
numpy.load()
重新加载它,我在连续运行时得到这种计时:
>>> ================================ RESTART ================================
loading: 0:00:11.422000 done.
>>> ================================ RESTART ================================
loading: 0:00:00.759000 done.
may be does numpy do some memory stuff to avoid future reload? numpy会做一些内存工作以避免将来重新加载吗?
Try this version, since you mentioned the checking wasn't important I have eliminated it. 尝试此版本,因为您提到检查并不重要,所以我已将其删除。
def load(filename):
with codecs.open(filename, 'rb', 'utf-8') as f:
for s in f:
yield tuple(s.rstrip().split("\t"))
results = [x for x in load('somebigfile.txt')]
check how many seconds is to actually read the lines of the file, like 检查实际读取文件行需要多少秒,例如
def load(filename):
features = []
with codecs.open(filename, 'rb', 'utf-8') as f:
return f.readlines()
If it is significantly less then 9 sec, then 如果明显少于9秒,则
and see if any of these speed things up 看看这些速度是否加快了
Having checked how long does it take to just iterate over the file, as bpgergo
suggests, you can check the following: 按照
bpgergo
建议,检查了遍历文件bpgergo
,您可以检查以下内容:
features = [None] * (10 ** 6)
to initialize your list features = [None] * (10 ** 6)
即可初始化您的列表 split()
onto tuple, it doesn't seem necessary. split()
的结果强制转换为元组,这似乎没有必要。 enumerate
at all. enumerate
中受益。 Just use: for line in f:
instead of for n, s in enumerate(f):
for line in f:
使用::而不是for n, s in enumerate(f):
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.