從 numpy.int32 數組的數據字節中有效地刪除每 4 個字節

Question

我有一個大的numpy.int32數組，可以占用 4GB 或更多。 它實際上是一個24 位 integer 數組（在音頻應用中很常見），但是由於numpy.int24不存在，我使用了一個int32 。

我想將此數組的數據作為 24 位（即每個數字 3 個字節）到文件中。

這行得通（我不久前在某個地方找到了這個“食譜”，但我再也找不到了）：

 import numpy as np x = np.array([[-33772,-2193],[13313,-1314],[20965,-1540],[10706,-5995],[-37719,-5871]], dtype=np.int32) data = ((x.reshape(x.shape + (1,)) >> np.array([0, 8, 16])) & 255).astype(np.uint8) print(data.tostring()) # b'\x14|\xffo\xf7\xff\x014\x00\xde\xfa\xff\xe5Q\x00\xfc\xf9\xff\xd2)\x00\x95\xe8\xff\xa9l\xff\x11\xe9\xff'

但是當x為幾 GB 時，許多reshape使其效率低下：它需要大量不需要的 RAM。

另一種解決方案是刪除每 4 個字節：
```
 s = bytes([c for i, c in enumerate(x.tostring()) if i % 4 != 3]) # b'\x14|\xffo\xf7\xff\x014\x00\xde\xfa\xff\xe5Q\x00\xfc\xf9\xff\xd2)\x00\x95\xe8\xff\xa9l\xff\x11\xe9\xff'
```
它可以工作，但我懷疑如果x占用 4 GB 的 RAM，那么這條線對於s和x至少會占用 8 GB 的 RAM（也許還有x.tostring() ？）

TL;DR：如何通過刪除每 4 個字節有效地（不使用兩倍於實際數據大小的 RAM）將 int32 數組作為 24 位數組寫入磁盤？

注意：這是可能的，因為整數實際上是 24 位的，即每個值的絕對值 < 2^23-1

Answer 1

假設x是 C-contiguous 並且您的平台是 little-endian（否則將需要很少的調整），您可以這樣做：

import numpy as np

# Input data
x = np.array([[-33772, -2193], [13313, -1314], [20965, -1540],
              [10706, -5995], [-37719, -5871]], dtype=np.int32)
# Make 24-bit uint8 view
x2 = np.ndarray(shape=x.shape + (3,), dtype=np.uint8, buffer=x, offset=0, 
                strides=x.strides + (1,))  
print(x2.tostring())
# b'\x14|\xffo\xf7\xff\x014\x00\xde\xfa\xff\xe5Q\x00\xfc\xf9\xff\xd2)\x00\x95...
np.save('data.npy', x2)  # Save to disk

在此示例中，請注意：

我們添加了一個維度： x.shape + (3,) is (5, 2, 3) 。
x2本質上是x的一個視圖，也就是說，它使用相同的數據。
訣竅在於大步前進。 x.strides + (1,)在這里(8, 4, 1) 。 x的每一新行相對於其前一行前進 8 個字節，並且每個新列前進 4 個字節。 在x2中，我將 1 添加到步幅，因此新的最內層維度中的每個項目都相對於前一個項目前進 1 個字節。 如果x2的形狀是 (5, 2, 4) （即使用+ (4,)而不是+ (3,) ），它將與x相同，但由於它是 (5, 2, 3 )，最后一個字節只是“跳過”。

您可以使用以下方法恢復它：


x2 = np.load('data.npy', mmap_mode='r')  # Use mmap to avoid using extra memory
x3 = np.zeros(x2.shape[:-1] + (4,), np.uint8)
x3[..., :3] = x2
del x2  # Release mmap
# Fix negative sign in last byte (could do this in a loop
# or in "batches" if you want to avoid the intermediate
# array from the "&" operation, or with Numba)
x3[..., 3] = np.where(x3[..., 2] & 128, 255, 0)
# Make int32 view
x4 = np.ndarray(x3.shape[:-1], np.int32, buffer=x3, offset=0, strides=x3.strides[:-1])
print(x4)
# [[-33772  -2193]
#  [ 13313  -1314]
#  [ 20965  -1540]
#  [ 10706  -5995]
#  [-37719  -5871]]

Answer 2

經過更多的擺弄，我發現這是可行的：

import numpy as np
x = np.array([[-33772,-2193],[13313,-1314],[20965,-1540],[10706,-5995],[-37719,-5871]], dtype=np.int32)
x2 = x.view(np.uint8).reshape(-1,4)[:,:3]
print(x2.tostring())
# b'\x14|\xffo\xf7\xff\x014\x00\xde\xfa\xff\xe5Q\x00\xfc\xf9\xff\xd2)\x00\x95\xe8\xff\xa9l\xff\x11\xe9\xff'

這是一個時間+內存基准：

import numpy as np, time
t0 = time.time()
x = np.random.randint(10000, size=(125_000_000, 2), dtype=np.int32)  # 125M * 2 * 4 bytes ~ 1GB of RAM
print('Random array generated in %.1f sec.' % (time.time() - t0))
time.sleep(5)  
# you can check the RAM usage in the task manager in the meantime...
t0 = time.time()
x2 = x.view(np.uint8).reshape(-1,4)[:,:3]
x2.tofile('test')
print('24-bit output file written in %.1f sec.' % (time.time() - t0))

結果：

在 4.6 秒內生成隨機數組。
24 位 output 文件在 35.9 秒內寫入。

此外，在整個處理過程中僅使用了 ~1GB（使用 Windows 任務管理器進行監控）

@jdehesa 的方法給出了類似的結果，即如果我們改用這一行：

x2 = np.ndarray(shape=x.shape + (3,), dtype=np.uint8, buffer=x, offset=0, strides=x.strides + (1,))

該進程的 RAM 使用量也達到了 1GB 的峰值，在x2.tofile(...)上花費的時間約為 37 秒。

Answer 3

我運行了您的代碼並得到了與您的 35 秒相似的時間，但是當我的 SSD 可以達到 2GB/s 時，這對於 750MB 來說似乎太慢了。 我無法想象為什么它這么慢。 所以我決定使用OpenCV高度優化的 SIMD 代碼，通過剝離每 4 個字節的 Alpha/透明度信息，將 RGBA8888 圖像減少到 RGB888 - 這相當於將 32 位轉換為 24 位。

為了不使用太多額外的 memory，我一次以 1,000,000 個立體聲樣本 (6MB) 的形式進行處理，並將其附加到 output 文件中。 它在 1 秒內運行，並且文件與您的代碼創建的文件比較相同。

#!/usr/bin/env python3

import numpy as np
import cv2

def orig(x):
    x2 = x.view(np.uint8).reshape(-1,4)[:,:3]
    x2.tofile('orig.dat')

def chunked(x):
    BATCHSIZE = 1_000_000
    l = len(x)
    with open('test.dat', 'w') as file:
        for b in range(0,l,BATCHSIZE):
            s = min(BATCHSIZE,l-b)
            y = x[b:b+s,:].view(np.uint8).reshape(s*2,1,4) 
            z = cv2.cvtColor(y,cv2.COLOR_BGRA2BGR)
            # Append to file
            z.tofile(file)
            if b+s == l:
                break


# Repeatable randomness
np.random.seed(42)                                                                                         
# Create array of stereo samples
NSAMPLES = 125_000_000
x = np.random.randint(10000, size=(NSAMPLES, 2), dtype=np.int32)

# orig(x)
chunked(x)

從 numpy.int32 數組的數據字節中有效地刪除每 4 個字節

問題描述

3 個解決方案

解決方案1
2 2020-07-29 15:53:37

解決方案2
2 2020-07-29 16:29:42

解決方案3
1 2020-08-02 23:56:38

從 numpy.int32 數組的數據字節中有效地刪除每 4 個字節

問題描述

3 個解決方案

解決方案1 2 2020-07-29 15:53:37

解決方案2 2 2020-07-29 16:29:42

解決方案3 1 2020-08-02 23:56:38

解決方案1
2 2020-07-29 15:53:37

解決方案2
2 2020-07-29 16:29:42

解決方案3
1 2020-08-02 23:56:38