简体   繁体   English

用于Python中的集群计算的快速JSON序列化(以及与Pickle的比较)?

[英]Fast JSON serialization (and comparison with Pickle) for cluster computing in Python?

I have a set of data points, each described by a dictionary. 我有一组数据点,每个数据点由字典描述。 The processing of each data point is independent and I submit each one as a separate job to a cluster. 每个数据点的处理是独立的,我将每个数据点作为一个单独的作业提交给一个集群。 Each data point has a unique name, and my cluster submission wrapper simply calls a script that takes a data point's name and a file describing all the data points. 每个数据点都有一个唯一的名称,我的集群提交包装器只调用一个脚本,该脚本获取数据点的名称和描述所有数据点的文件。 That script then accesses the data point from the file and performs the computation. 然后该脚本从文件访问数据点并执行计算。

Since each job has to load the set of all points only to retrieve the point to be run, I wanted to optimize this step by serializing the file describing the set of points into an easily retrievable format. 由于每个作业必须加载所有点的集合才能检索要运行的点,我想通过将描述点集的文件序列化为易于检索的格式来优化此步骤。

I tried using JSONpickle, using the following method, to serialize a dictionary describing all the data points to file: 我尝试使用JSONpickle,使用以下方法,序列化描述文件的所有数据点的字典:

def json_serialize(obj, filename, use_jsonpickle=True):
    f = open(filename, 'w')
    if use_jsonpickle:
    import jsonpickle
    json_obj = jsonpickle.encode(obj)
    f.write(json_obj)
    else:
    simplejson.dump(obj, f, indent=1)   
    f.close()

The dictionary contains very simple objects (lists, strings, floats, etc.) and has a total of 54,000 keys. 字典包含非常简单的对象(列表,字符串,浮点数等),总共有54,000个键。 The json file is ~20 Megabytes in size. json文件的大小约为20兆字节。

It takes ~20 seconds to load this file into memory, which seems very slow to me. 将此文件加载到内存大约需要20秒,这对我来说似乎很慢。 I switched to using pickle with the same exact object, and found that it generates a file that's about 7.8 megabytes in size, and can be loaded in ~1-2 seconds. 我切换到使用具有相同确切对象的pickle,并发现它生成一个大约7.8兆字节的文件,并且可以在1-2秒内加载。 This is a significant improvement, but it still seems like loading of a small object (less than 100,000 entries) should be faster. 这是一个重大的改进,但似乎加载一个小对象(少于100,000个条目)应该更快。 Aside from that, pickle is not human readable, which was the big advantage of JSON for me. 除此之外,泡菜不是人类可读的,这对我来说是JSON的最大优势。

Is there a way to use JSON to get similar or better speed ups? 有没有办法使用JSON来获得类似或更好的加速? If not, do you have other ideas on structuring this? 如果没有,你对构建这个有什么想法吗?

(Is the right solution to simply "slice" the file describing each event into a separate file and pass that on to the script that runs a data point in a cluster job? It seems like that could lead to a proliferation of files). (将解决每个事件的文件简单地“切片”到一个单独的文件中并将其传递给在集群作业中运行数据点的脚本是正确的解决方案吗?看起来这可能会导致文件激增)。

thanks. 谢谢。

marshal is fastest, but pickle per se is not -- maybe you mean cPickle (which is pretty fast, esp. with a -1 protocol). marshal是最快的,但是pickle本身不是 - 也许你的意思是cPickle (这是非常快的,特别是-1协议)。 So, apart from readability issues, here's some code to show various possibilities: 因此,除了可读性问题之外,还有一些代码可以显示各种可能性:

import pickle
import cPickle
import marshal
import json

def maked(N=5400):
  d = {}
  for x in range(N):
    k = 'key%d' % x
    v = [x] * 5
    d[k] = v
  return d
d = maked()

def marsh():
  return marshal.dumps(d)

def pick():
  return pickle.dumps(d)

def pick1():
  return pickle.dumps(d, -1)

def cpick():
  return cPickle.dumps(d)

def cpick1():
  return cPickle.dumps(d, -1)

def jso():
  return json.dumps(d)

def rep():
  return repr(d)

and here are their speeds on my laptop: 以下是我在笔记本电脑上的速度:

$ py26 -mtimeit -s'import pik' 'pik.marsh()'
1000 loops, best of 3: 1.56 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.pick()'
10 loops, best of 3: 173 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.pick1()'
10 loops, best of 3: 241 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.cpick()'
10 loops, best of 3: 21.8 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.cpick1()'
100 loops, best of 3: 10 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.jso()'
10 loops, best of 3: 138 msec per loop
$ py26 -mtimeit -s'import pik' 'pik.rep()'
100 loops, best of 3: 13.1 msec per loop

so, you can have readability and ten times the speed of json.dumps with repr (you sacrifice the ease of parsing from Javascript and other languages); 所以,你可以具有可读性使用reprjson.dumps十倍速度(你牺牲了从Javascript和其他语言解析的容易性); you can have the absolute maximum speed with marshal , almost 90 times faster than json ; 你可以拥有绝对的最高速度与marshal ,几乎比json快90倍; cPickle offers way more generality (in terms of what you can serialize) than either json or marshal , but if you're never going to use that generality then you might as well go for marshal (or repr if human readability trumps speed). cPickle提供方式更一般的(在可以序列什么条件)比任何jsonmarshal ,但如果你永远不会使用普遍性,那么你还不如去marshal (或repr如果人类可读性胜过速度)。

As for your "slicing" idea, in lieu of a multitude of files, you might want to consider a database (a multitude of records) -- you might even get away without actual serialization if you're running with data that has some recognizable "schema" to it. 至于你的“切片”想法,代替大量文件,你可能想要考虑一个数据库(多个记录) - 如果你运行的数据有一些可识别的话,你甚至可能在没有实际序列化的情况下逃脱“架构”到它。

I think you are facing a trade-off here: human-readability comes at the cost of performance and large file size. 我认为你在这里面临着一个权衡:人类可读性是以性能和大文件大小为代价的。 Thus, of all the serialization methods available in Python, JSON is not only the most readable, it is also the slowest. 因此,在Python中可用的所有序列化方法中,JSON不仅是最易读的,它也是最慢的。

If I had to pursue performance (and file compactness), I'd go for marshall . 如果我不得不追求表现(和文件紧凑性),我会选择马歇尔 You can either marshal the whole set with dump() and load() or, building on your idea of slicing things up, marshal separate parts of the data set into separate files. 您可以使用dump()load()编组整个集合,或者根据您对切片的想法进行编组,将数据集的单独部分编组为单独的文件。 This way you open the door for parallelization of the data processing -- if you feel so inclined. 这样你就可以打开数据处理并行化的大门 - 如果你有这种倾向的话。

Of course, there are all kinds of restrictions and warnings in the documentation, so if you decide to play it safe, go for pickle . 当然,文档中有各种限制和警告,所以如果你决定安全地玩,那就去泡菜吧

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM