简体   繁体   English

将数据保存在 memory 和 Redis 中的最快方法 Python

[英]Fastest way to keep data in memory with Redis in Python

I need to save once and load multiples times some big arrays in a flask application with Python 3. I originally stored these arrays on disk with the json library.我需要在带有 Python 的 flask 应用程序中保存一次并加载一些大的 arrays 的倍数 3. 我最初将这些 arrays 与 json 库一起存储在磁盘上。 In order to speed up this, I used Redis on the same machine to store the array by serializing the array in a JSON string.为了加快速度,我在同一台机器上使用 Redis 通过将数组序列化为 JSON 字符串来存储数组。 I wonder why I get no improvement (actually it takes more time on the server I use) whereas Redis keeps data in RAM.我想知道为什么我没有任何改进(实际上它在我使用的服务器上花费了更多时间)而 Redis 将数据保存在 RAM 中。 I guess the JSON serialization isn't optimize but I have no clue how I could speed up this:我猜 JSON 序列化没有优化,但我不知道如何加快它:

import json
import redis
import os 
import time

current_folder = os.path.dirname(os.path.abspath(__file__))
file_path = os.path.join(current_folder, "my_file")

my_array = [1]*10000000

with open(file_path, 'w') as outfile:
    json.dump(my_array, outfile)

start_time = time.time()
with open(file_path, 'r') as infile:
    my_array = json.load(infile)
print("JSON from disk  : ", time.time() - start_time)

r = redis.Redis()
my_array_as_string = json.dumps(my_array)
r.set("my_array_as_string", my_array_as_string)

start_time = time.time()
my_array_as_string = r.get("my_array_as_string")
print("Fetch from Redis:", time.time() - start_time)

start_time = time.time()
my_array = json.loads(my_array_as_string)
print("Parse JSON      :", time.time() - start_time)

Result:结果:

JSON from disk  : 1.075700044631958
Fetch from Redis: 0.078125
Parse JSON      : 1.0247752666473389

EDIT : it seems that fetching from redis is actually fast, but the JSON parsing is quite slow.编辑:似乎从 redis 获取实际上很快,但 JSON 解析相当慢。 Is there a way to fetch directly an array from Redis without the JSON serialization part?有没有办法在没有 JSON 序列化部分的情况下直接从 Redis 获取数组? This is what we do with pyMySQL and it is fast.这就是我们用 pyMySQL 所做的,而且速度很快。

Update: Nov 08, 2019 - Run the same test on Python3.6更新:2019 年 11 月 8 日 - 在 Python3.6 上运行相同的测试

Results:结果:

Dump Time : JSON > msgpack > pickle > marshal转储时间:JSON > msgpack > pickle > marshal
Load Time : JSON > pickle > msgpack > marshal加载时间:JSON > pickle > msgpack > marshal
Space : marshal > JSON > pickle > msgpack空间:元帅> JSON>泡菜> msgpack

+---------+-----------+-----------+-------+
| package | dump time | load time | size  |
+---------+-----------+-----------+-------+
| json    | 0.00134   | 0.00079   | 30049 |
| pickle  | 0.00023   | 0.00019   | 20059 |
| msgpack | 0.00031   | 0.00012   | 10036 |
| marshal | 0.00022   | 0.00010   | 50038 |
+---------+-----------+-----------+-------+

I tried pickle vs json vs msgpack vs marshal.我试过pickle vs json vs msgpack vs marshal。

Pickle is much much slower than JSON. Pickle 比 JSON 慢得多。 And msgpack is atleast 4x faster that JSON. msgpack至少比 JSON 快 4 倍。 MsgPack looks like the best option you have. MsgPack 看起来是您拥有的最佳选择。

Edit: Tried marshal also.编辑:也试过元帅。 Marshal is faster than JSON, but slower than msgpack. Marshal 比 JSON 快,但比 msgpack 慢。

Time taken : Pickle > JSON > Marshal > MsgPack花费的时间:Pickle > JSON > Marshal > MsgPack
Space taken : Marshal > Pickle > Json > MsgPack占用空间:Marshal > Pickle > Json > MsgPack

import time
import json
import pickle
import msgpack
import marshal
import sys

array = [1]*10000

start_time = time.time()
json_array = json.dumps(array)
print "JSON dumps: ", time.time() - start_time
print "JSON size: ", sys.getsizeof(json_array)
start_time = time.time()
_ = json.loads(json_array)
print "JSON loads: ", time.time() - start_time

# --------------

start_time = time.time()
pickled_object = pickle.dumps(array)
print "Pickle dumps: ", time.time() - start_time
print "Pickle size: ", sys.getsizeof(pickled_object)
start_time = time.time()
_ = pickle.loads(pickled_object)
print "Pickle loads: ", time.time() - start_time


# --------------

start_time = time.time()
package = msgpack.dumps(array)
print "Msg Pack dumps: ", time.time() - start_time
print "MsgPack size: ", sys.getsizeof(package)
start_time = time.time()
_ = msgpack.loads(package)
print "Msg Pack loads: ", time.time() - start_time

# --------------

start_time = time.time()
m_package = marshal.dumps(array)
print "Marshal dumps: ", time.time() - start_time
print "Marshal size: ", sys.getsizeof(m_package)
start_time = time.time()
_ = marshal.loads(m_package)
print "Marshal loads: ", time.time() - start_time

Result:结果:

    JSON dumps:  0.000760078430176
JSON size:  30037
JSON loads:  0.000488042831421
Pickle dumps:  0.0108790397644
Pickle size:  40043
Pickle loads:  0.0100247859955
Msg Pack dumps:  0.000202894210815
MsgPack size:  10040
Msg Pack loads:  7.58171081543e-05
Marshal dumps:  0.000118017196655
Marshal size:  50042
Marshal loads:  0.000118970870972

Some explanation:一些解释:

  1. Load data from disk doesn't always means disk access, often the data returned from in-memory OS cache, and when this happens this is even faster than get data from Redis (remove network communication from total time)从磁盘加载数据并不总是意味着磁盘访问,通常是从内存操作系统缓存返回的数据,当发生这种情况时,这甚至比从 Redis 获取数据更快(从总时间中删除网络通信)

  2. The main performance killer is JSON parsing (cpt. Obvious)主要的性能杀手是 JSON 解析(cpt. Obvious)

  3. JSON parsing from disk most likely is done in parallel with data loading (from filestream)从磁盘解析 JSON 很可能与数据加载(来自文件流)并行完成

  4. There is no option to parse from stream with Redis (at least I do not know such API)没有使用Redis从流中解析的选项(至少我不知道这样的API)


You may speedup app with minimal changes just by storing your cache files on tmpfs .只需将缓存文件存储在tmpfs上,您就可以以最少的更改加速应用程序。 It is quite close to Redis setup on the same server.它非常接近同一服务器上的 Redis 设置。

Agree with @RoopakANelliat msgpack is about 4x faster than JSON.同意@RoopakANelliat msgpack 比 JSON 快大约 4 倍。 Format change will boost parsing performance (if it is possible).格式更改将提高解析性能(如果可能)。

I made brain-plasma specifically for this reason - fast loading and reloading of big objects in a Flask app.出于这个原因,我专门制作了脑等离子体- 在 Flask 应用程序中快速加载和重新加载大对象。 It's a shared-memory object namespace for Apache Arrow-serializable objects, including pickle 'd bytestrings generated by pickle.dumps(...) .它是 Apache Arrow 可序列化对象的共享内存对象命名空间,包括pickle.dumps(...)生成的pickle 'd 字节pickle.dumps(...)

$ pip install brain-plasma
$ plasma_store -m 10000000 -s /tmp/plasma # 10MB memory
from brain_plasma import Brain
brain = Brain()

brain['a'] = [1]*10000
brain['a']
# >>> [1,1,1,1,...]

Redis 的 RedisJson 扩展: https ://oss.redislabs.com/redisjson/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM