在redis中存儲numpy數組的最快方法

Question

我在 AI 項目中使用 redis。

這個想法是讓多個環境模擬器在許多 cpu 內核上運行策略。 模擬器將經驗（狀態/動作/獎勵元組列表）寫入 redis 服務器（重播緩沖區）。 然后訓練過程將經驗作為數據集讀取以生成新策略。 新策略被部署到模擬器，之前運行的數據被刪除，過程繼續。

大部分體驗都在“狀態”中捕獲。 這通常表示為一個大的 numpy 維度數組，比如 80 x 80。模擬器以 CPU 允許的速度生成這些。

為此，有沒有人有將大量 numpy 數組寫入 redis 的最佳/最快/最簡單方法的好主意或經驗。 這一切都在同一台機器上，但后來可能在一組雲服務器上。 歡迎使用代碼示例！

Answer 1

我不知道它是否最快，但你可以嘗試這樣的事情......

將 Numpy 數組存儲到 Redis 是這樣的 - 請參閱函數toRedis() ：

獲取 Numpy 數組的形狀並進行編碼
將 Numpy 數組作為字節附加到形狀
將編碼數組存儲在提供的密鑰下

檢索 Numpy 數組是這樣的 - 請參閱函數fromRedis() ：

從Redis檢索與提供的密鑰對應的編碼字符串
從字符串中提取 Numpy 數組的形狀
提取數據並重新填充 Numpy 數組，重新整形為原始形狀

#!/usr/bin/env python3

import struct
import redis
import numpy as np

def toRedis(r,a,n):
   """Store given Numpy array 'a' in Redis under key 'n'"""
   h, w = a.shape
   shape = struct.pack('>II',h,w)
   encoded = shape + a.tobytes()

   # Store encoded data in Redis
   r.set(n,encoded)
   return

def fromRedis(r,n):
   """Retrieve Numpy array from Redis key 'n'"""
   encoded = r.get(n)
   h, w = struct.unpack('>II',encoded[:8])
   # Add slicing here, or else the array would differ from the original
   a = np.frombuffer(encoded[8:]).reshape(h,w)
   return a

# Create 80x80 numpy array to store
a0 = np.arange(6400,dtype=np.uint16).reshape(80,80) 

# Redis connection
r = redis.Redis(host='localhost', port=6379, db=0)

# Store array a0 in Redis under name 'a0array'
toRedis(r,a0,'a0array')

# Retrieve from Redis
a1 = fromRedis(r,'a0array')

np.testing.assert_array_equal(a0,a1)

你可以通過編碼添加更多的靈活性dtype與形狀沿numpy的陣列。 我沒有這樣做，因為可能的情況是您已經知道所有數組都是一種特定類型，然后代碼會無緣無故地變得更大更難閱讀。

現代 iMac 的粗略基准測試：

80x80 Numpy array of np.uint16   => 58 microseconds to write
200x200 Numpy array of np.uint16 => 88 microseconds to write

關鍵詞：Python、Numpy、Redis、數組、序列化、序列化、key、incr、唯一

Answer 2

您還可以考慮使用msgpack-numpy ，它提供“編碼和解碼例程，可以使用高效的 msgpack 格式對 numpy 提供的數值和數組數據類型進行序列化和反序列化。” - 見https://msgpack.org/ 。

快速概念驗證：

import msgpack
import msgpack_numpy as m
import numpy as np
m.patch()               # Important line to monkey-patch for numpy support!

from redis import Redis

r = Redis('127.0.0.1')

# Create an array, then use msgpack to serialize it 
d_orig = np.array([1,2,3,4])
d_orig_packed = m.packb(d_orig)

# Set the data in redis
r.set('d', d_orig_packed)

# Retrieve and unpack the data
d_out = m.unpackb(r.get('d'))

# Check they match
assert np.alltrue(d_orig == d_out)
assert d_orig.dtype == d_out.dtype

在我的機器上，msgpack 的運行速度比使用 struct 快得多：

In: %timeit struct.pack('4096L', *np.arange(0, 4096))
1000 loops, best of 3: 443 µs per loop

In: %timeit m.packb(np.arange(0, 4096))
The slowest run took 7.74 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 32.6 µs per loop

Answer 3

您可以查看 Mark Setchell 的答案，了解如何將字節實際寫入 Redis。 下面我重寫了fromRedis和toRedis的函數，以說明可變維度大小的數組並包括數組形狀。

def toRedis(arr: np.array) -> str:
    arr_dtype = bytearray(str(arr.dtype), 'utf-8')
    arr_shape = bytearray(','.join([str(a) for a in arr.shape]), 'utf-8')
    sep = bytearray('|', 'utf-8')
    arr_bytes = arr.ravel().tobytes()
    to_return = arr_dtype + sep + arr_shape + sep + arr_bytes
    return to_return

def fromRedis(serialized_arr: str) -> np.array:
    sep = '|'.encode('utf-8')
    i_0 = serialized_arr.find(sep)
    i_1 = serialized_arr.find(sep, i_0 + 1)
    arr_dtype = serialized_arr[:i_0].decode('utf-8')
    arr_shape = tuple([int(a) for a in serialized_arr[i_0 + 1:i_1].decode('utf-8').split(',')])
    arr_str = serialized_arr[i_1 + 1:]
    arr = np.frombuffer(arr_str, dtype = arr_dtype).reshape(arr_shape)
    return arr

Answer 4

嘗試使用 Plasma，因為它避免了序列化/反序列化開銷。

使用 pip install pyarrow 安裝等離子

文檔： https : //arrow.apache.org/docs/python/plasma.html

首先，啟動具有 1 GB 內存的等離子 [終端]：

Plasma_store -m 1000000000 -s /tmp/plasma

import pyarrow.plasma as pa
import numpy as np
client = pa.connect("/tmp/plasma")
temp = np.random.rand(80,80)

寫入時間：130 µs 與 782 µs（Redis 實現：Mark Setchell 的回答）

使用等離子大頁面可以改善寫入時間，但僅適用於 Linux 機器： https : //arrow.apache.org/docs/python/plasma.html#using-plasma-with-huge-pages

獲取時間：31.2 µs 與 99.5 µs（Redis 實現：Mark Setchell 的回答）

PS：代碼在 MacPro 上運行

Answer 5

tobytes()函數的存儲效率不是很高。 為了減少必須寫入redis服務器的存儲，您可以使用base64包：

def encode_vector(ar):
    return base64.encodestring(ar.tobytes()).decode('ascii')

def decode_vector(ar):
    return np.fromstring(base64.decodestring(bytes(ar.decode('ascii'), 'ascii')), dtype='uint16')

@EDIT：好的，由於 Redis 將值存儲為字節字符串，因此直接存儲字節字符串的存儲效率更高。 但是，如果將其轉換為字符串、將其打印到控制台或將其存儲在文本文件中，則進行編碼是有意義的。

在redis中存儲numpy數組的最快方法

問題描述

5 個解決方案

解決方案1
24 已采納 2019-03-23 11:38:25

解決方案2
7 2020-03-05 02:14:06

解決方案3
4 2020-02-28 18:59:07

解決方案4
3 2020-09-24 14:04:31

解決方案5
1 2019-09-05 06:43:37

在redis中存儲numpy數組的最快方法

問題描述

5 個解決方案

解決方案1 24 已采納 2019-03-23 11:38:25

解決方案2 7 2020-03-05 02:14:06

解決方案3 4 2020-02-28 18:59:07

解決方案4 3 2020-09-24 14:04:31

解決方案5 1 2019-09-05 06:43:37

解決方案1
24 已采納 2019-03-23 11:38:25

解決方案2
7 2020-03-05 02:14:06

解決方案3
4 2020-02-28 18:59:07

解決方案4
3 2020-09-24 14:04:31

解決方案5
1 2019-09-05 06:43:37