简体   繁体   English

Python列表序列化 - 最快的方法

[英]Python list serialization - fastest method

I need to load (de-serialize) a pre-computed list of integers from a file in a Python script (into a Python list). 我需要从Python脚本中的文件(到Python列表)中加载(反序列化)预先计算的整数列表。 The list is large (upto millions of items), and I can choose the format I store it in, as long as loading is fastest. 列表很大(高达数百万项),只要加载速度最快,我就可以选择存储它的格式。

Which is the fastest method, and why? 哪种方法最快,为什么?

  1. Using import on a .py file that just contains the list assigned to a variable 在.py文件上使用import ,该文件仅包含分配给变量的列表
  2. Using cPickle 's load 使用cPickleload
  3. Some other method (perhaps numpy ?) 其他一些方法(也许是numpy ?)

Also, how can one benchmark such things reliably? 另外,如何可靠地对这些事情进行基准测试?

Addendum: measuring this reliably is difficult, because import is cached so it can't be executed multiple times in a test. 附录:难以可靠地测量这一点很困难,因为import是缓存的,因此在测试中不能多次执行。 The loading with pickle also gets faster after the first time probably because page-precaching by the OS. 第一次使用pickle加载也会因为操作系统的页面预先缓存而变得更快。 Loading 1 million numbers with cPickle takes 1.1 sec the first time run, and 0.2 sec on subsequent executions of the script. 使用cPickle加载100万个数字在第一次运行时需要1.1秒,在后续执行脚本时需要0.2秒。

Intuitively I feel cPickle should be faster, but I'd appreciate numbers (this is quite a challenge to measure, I think). 直觉上我觉得cPickle应该更快,但我会欣赏数字(我认为这是一个相当大的挑战)。

And yes, it's important for me that this performs quickly. 是的,对我来说这很重要。

Thanks 谢谢

I would guess cPickle will be fastest if you really need the thing in a list. 我猜如果你真的需要列表中的东西, cPickle将是最快的。

If you can use an array , which is a built-in sequence type, I timed this at a quarter of a second for 1 million integers: 如果你可以使用一个内置序列类型的数组 ,我会在四分之一秒内为100万个整数计时:

from array import array
from datetime import datetime

def WriteInts(theArray,filename):
    f = file(filename,"wb")
    theArray.tofile(f)
    f.close()

def ReadInts(filename):
    d = datetime.utcnow()
    theArray = array('i')
    f = file(filename,"rb")
    try:
        theArray.fromfile(f,1000000000)
    except EOFError:
        pass
    print "Read %d ints in %s" % (len(theArray),datetime.utcnow() - d)
    return theArray

if __name__ == "__main__":
    a = array('i')
    a.extend(range(0,1000000))
    filename = "a_million_ints.dat"
    WriteInts(a,filename)
    r = ReadInts(filename)
    print "The 5th element is %d" % (r[4])

For benchmarking, see the timeit module in the Python standard library. 有关基准测试,请参阅Python标准库中的timeit模块。 To see what is the fastest way, implement all the ways you can think of and measure them with timeit. 要了解最快的方法,请实施您能想到的所有方法并使用timeit进行测量。

Random thought: depending on what you're doing exactly, you may find it fastest to store "sets of integers" in the style used in .newsrc files: 随机思考:根据您正在做的事情,您可能会发现以.newsrc文件中使用的样式存储“整数集”的速度最快:

1, 3-1024, 11000-1200000

If you need to check whether something is in that set, then loading and matching with such a representation should be among the fastest ways. 如果您需要检查该集合中是否存在某些内容,那么加载和匹配此类表示应该是最快的方法之一。 This assumes your sets of integers are reasonably dense, with long consecutive sequences of adjacent values. 这假设您的整数集合相当密集,具有相邻值的长连续序列。

"how can one benchmark such things reliably?" “如何可靠地对这些事情进行基准测试?”

I don't get the question. 我不明白这个问题。

You write a bunch of little functions to create and save your list in various forms. 您编写了一些小函数来以各种形式创建和保存列表。

You write a bunch of little functions to load your lists in their various forms. 您编写了一些小函数来以各种形式加载列表。

You write a little timer function to get start time, execute the load procedure a few dozen times (to get a solid average that's long enough that OS scheduling noise doesn't dominate your measurements). 你编写一个小计时器函数来获取开始时间,执行加载程序几十次(获得足够长的稳定平均值,以至于OS调度噪声不会影响你的测量)。

You summarize your data in a little report. 您可以在一个小报告中汇总数据。

What's unreliable about this? 这有什么不可靠的?

Here are some unrelated questions that shows how to measure and compare performance. 以下是一些不相关的问题,展示了如何衡量和比较绩效。

Convert list of ints to one number? 将整数列表转换为一个数字?

String concatenation vs. string substitution in Python Python中的字符串连接与字符串替换

To help you with timing, the Python Library provides the timeit module: 为了帮助您完成计时,Python Library提供了timeit模块:

This module provides a simple way to time small bits of Python code. 该模块提供了一种简单的方法来计算一小段Python代码。 It has both command line as well as callable interfaces. 它既有命令行,也有可调用的接口。 It avoids a number of common traps for measuring execution times. 它避免了许多用于测量执行时间的常见陷阱。

An example (from the manual) that compares the cost of using hasattr() vs. try/except to test for missing and present object attributes: 一个示例(来自手册),比较使用hasattr()try/except的成本来测试缺少的和当前的对象属性:

% timeit.py 'try:' '  str.__nonzero__' 'except AttributeError:' '  pass'
100000 loops, best of 3: 15.7 usec per loop
% timeit.py 'if hasattr(str, "__nonzero__"): pass'
100000 loops, best of 3: 4.26 usec per loop
% timeit.py 'try:' '  int.__nonzero__' 'except AttributeError:' '  pass'
1000000 loops, best of 3: 1.43 usec per loop
% timeit.py 'if hasattr(int, "__nonzero__"): pass'
100000 loops, best of 3: 2.23 usec per loop

Do you need to always load the whole file? 您是否需要始终加载整个文件? If not, upack_from() might be the best solution. 如果没有, upack_from()可能是最好的解决方案。 Suppose, that you have 1000000 integers, but you'd like to load just the ones from 50000 to 50099, you'd do: 假设您有1000000个整数,但是您只想加载50000到50099之间的整数,那么您需要:

import struct
intSize = struct.calcsize('i') #this value would be constant for a given arch
intFile = open('/your/file.of.integers')
intTuple5K100 = struct.unpack_from('i'*100,intFile,50000*intSize)

cPickle will be the fastest since it is saved in binary and no real python code has to be parsed. cPickle将是最快的,因为它以二进制保存,并且不需要解析真正的python代码。

Other advantates are that it is more secure (since it does not execute commands) and you have no problems with setting $PYTHONPATH correctly. 其他优点是它更安全(因为它不执行命令)并且你没有正确设置$PYTHONPATH问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM