简体   繁体   中英

Fastest way to keep data in memory with Redis in Python

I need to save once and load multiples times some big arrays in a flask application with Python 3. I originally stored these arrays on disk with the json library. In order to speed up this, I used Redis on the same machine to store the array by serializing the array in a JSON string. I wonder why I get no improvement (actually it takes more time on the server I use) whereas Redis keeps data in RAM. I guess the JSON serialization isn't optimize but I have no clue how I could speed up this:

import json
import redis
import os 
import time

current_folder = os.path.dirname(os.path.abspath(__file__))
file_path = os.path.join(current_folder, "my_file")

my_array = [1]*10000000

with open(file_path, 'w') as outfile:
    json.dump(my_array, outfile)

start_time = time.time()
with open(file_path, 'r') as infile:
    my_array = json.load(infile)
print("JSON from disk  : ", time.time() - start_time)

r = redis.Redis()
my_array_as_string = json.dumps(my_array)
r.set("my_array_as_string", my_array_as_string)

start_time = time.time()
my_array_as_string = r.get("my_array_as_string")
print("Fetch from Redis:", time.time() - start_time)

start_time = time.time()
my_array = json.loads(my_array_as_string)
print("Parse JSON      :", time.time() - start_time)

Result:

JSON from disk  : 1.075700044631958
Fetch from Redis: 0.078125
Parse JSON      : 1.0247752666473389

EDIT : it seems that fetching from redis is actually fast, but the JSON parsing is quite slow. Is there a way to fetch directly an array from Redis without the JSON serialization part? This is what we do with pyMySQL and it is fast.

Update: Nov 08, 2019 - Run the same test on Python3.6

Results:

Dump Time : JSON > msgpack > pickle > marshal
Load Time : JSON > pickle > msgpack > marshal
Space : marshal > JSON > pickle > msgpack

+---------+-----------+-----------+-------+
| package | dump time | load time | size  |
+---------+-----------+-----------+-------+
| json    | 0.00134   | 0.00079   | 30049 |
| pickle  | 0.00023   | 0.00019   | 20059 |
| msgpack | 0.00031   | 0.00012   | 10036 |
| marshal | 0.00022   | 0.00010   | 50038 |
+---------+-----------+-----------+-------+

I tried pickle vs json vs msgpack vs marshal.

Pickle is much much slower than JSON. And msgpack is atleast 4x faster that JSON. MsgPack looks like the best option you have.

Edit: Tried marshal also. Marshal is faster than JSON, but slower than msgpack.

Time taken : Pickle > JSON > Marshal > MsgPack
Space taken : Marshal > Pickle > Json > MsgPack

import time
import json
import pickle
import msgpack
import marshal
import sys

array = [1]*10000

start_time = time.time()
json_array = json.dumps(array)
print "JSON dumps: ", time.time() - start_time
print "JSON size: ", sys.getsizeof(json_array)
start_time = time.time()
_ = json.loads(json_array)
print "JSON loads: ", time.time() - start_time

# --------------

start_time = time.time()
pickled_object = pickle.dumps(array)
print "Pickle dumps: ", time.time() - start_time
print "Pickle size: ", sys.getsizeof(pickled_object)
start_time = time.time()
_ = pickle.loads(pickled_object)
print "Pickle loads: ", time.time() - start_time


# --------------

start_time = time.time()
package = msgpack.dumps(array)
print "Msg Pack dumps: ", time.time() - start_time
print "MsgPack size: ", sys.getsizeof(package)
start_time = time.time()
_ = msgpack.loads(package)
print "Msg Pack loads: ", time.time() - start_time

# --------------

start_time = time.time()
m_package = marshal.dumps(array)
print "Marshal dumps: ", time.time() - start_time
print "Marshal size: ", sys.getsizeof(m_package)
start_time = time.time()
_ = marshal.loads(m_package)
print "Marshal loads: ", time.time() - start_time

Result:

    JSON dumps:  0.000760078430176
JSON size:  30037
JSON loads:  0.000488042831421
Pickle dumps:  0.0108790397644
Pickle size:  40043
Pickle loads:  0.0100247859955
Msg Pack dumps:  0.000202894210815
MsgPack size:  10040
Msg Pack loads:  7.58171081543e-05
Marshal dumps:  0.000118017196655
Marshal size:  50042
Marshal loads:  0.000118970870972

Some explanation:

  1. Load data from disk doesn't always means disk access, often the data returned from in-memory OS cache, and when this happens this is even faster than get data from Redis (remove network communication from total time)

  2. The main performance killer is JSON parsing (cpt. Obvious)

  3. JSON parsing from disk most likely is done in parallel with data loading (from filestream)

  4. There is no option to parse from stream with Redis (at least I do not know such API)


You may speedup app with minimal changes just by storing your cache files on tmpfs . It is quite close to Redis setup on the same server.

Agree with @RoopakANelliat msgpack is about 4x faster than JSON. Format change will boost parsing performance (if it is possible).

I made brain-plasma specifically for this reason - fast loading and reloading of big objects in a Flask app. It's a shared-memory object namespace for Apache Arrow-serializable objects, including pickle 'd bytestrings generated by pickle.dumps(...) .

$ pip install brain-plasma
$ plasma_store -m 10000000 -s /tmp/plasma # 10MB memory
from brain_plasma import Brain
brain = Brain()

brain['a'] = [1]*10000
brain['a']
# >>> [1,1,1,1,...]

Redis 的 RedisJson 扩展: https ://oss.redislabs.com/redisjson/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM