内存中的numpy数组（图像，uint8）的有损压缩

Question

I am trying to load a data set of 1.000.000 images into memory. 我正在尝试将1.000.000图像的数据集加载到内存中。 As standard numpy arrays (uint8) all images combined fill around 100 GB of RAM, but I need to get this down to < 50 GB while still being able to quickly read the images back into numpy (that's the whole point of keeping everything in memory). 作为标准的numpy数组（uint8），所有合并的图像都填充了约100 GB的RAM，但是我需要将其降低到<50 GB，同时仍然能够快速将图像读回到numpy（这就是将所有内容保留在内存中的全部目的）。 Lossless compression like blosc only reduces file size by around 10%, so I went to JPEG compression. 像blosc这样的无损压缩只会将文件大小减少10％左右，因此我选择了JPEG压缩。 Minimum example: 最小示例：

import io
from PIL import Image

numpy_array = (255 * np.random.rand(256, 256, 3)).astype(np.uint8)
image = Image.fromarray(numpy_array)
output = io.BytesIO()
image.save(output, format='JPEG')

At runtime I am reading the images with: 在运行时，我通过以下方式读取图像：

[np.array(Image.open(output)) for _ in range(1000)]

JPEG compression is very effective (< 10 GB), but the time it takes to read 1000 images back into numpy array is around 2.3 seconds, which seriously hurts the performance of my experiments. JPEG压缩非常有效（<10 GB），但是将1000张图像读回到numpy数组所需的时间约为2.3秒，这严重损害了我的实验性能。 I am searching for suggestions that give a better trade-off between compression and read-speed. 我正在寻找可以在压缩和读取速度之间取得更好权衡的建议。

Answer 1

I am still not certain I understand what you are trying to do, but I created some dummy images and did some tests as follows. 我仍然不确定我了解您要做什么，但是我创建了一些虚拟图像并进行了如下测试。 I'll show how I did that in case other folks feel like trying other methods and want a data set. 我将展示如何做到这一点，以防其他人想尝试其他方法并想要一个数据集。

First, I created 1,000 images using GNU Parallel and ImageMagick like this: 首先，我使用GNU Parallel和ImageMagick创建了1,000个图像，如下所示：

parallel convert -depth 8 -size 256x256 xc:red +noise random -fill white -gravity center -pointsize 72 -annotate 0 "{}" -alpha off s_{}.png ::: {0..999}

That gives me 1,000 images called s_0.png through s_999.png and image 663 looks like this: 这给了我1,000幅名为s_0.png到s_999.png图像，图像663看起来像这样：

Then I did what I think you are trying to do - though it is hard to tell from your code: 然后我做了我认为您正在尝试做的事情-尽管很难从您的代码中看出来：

#!/usr/local/bin/python3

import io
import time
import numpy as np
from PIL import Image

# Create BytesIO object
output = io.BytesIO()

# Load all 1,000 images and write into BytesIO object
for i in range(1000):
   name="s_{}.png".format(i)
   print("Opening image: {}".format(name))
   im = Image.open(name)
   im.save(output, format='JPEG',quality=50)
   nbytes = output.getbuffer().nbytes
   print("BytesIO size: {}".format(nbytes))

# Read back images from BytesIO ito list
start=time.clock()
l=[np.array(Image.open(output)) for _ in range(1000)]
diff=time.clock()-start
print("Time: {}".format(diff))

And that takes 2.4 seconds to read all 1,000 images from the BytesIO object and turn them into numpy arrays. 这需要2.4秒才能从BytesIO对象读取所有1,000个图像并将它们转换为numpy数组。

Then, I palettised the images by reducing to 256 colours (which I agree is lossy - just as your method) and saved a list of palettised image objects which I can readily later convert back to numpy arrays by simply calling: 然后，我通过减少为256种颜色（我同意这是有损的-就像您的方法一样）来使图像调色板化，并保存了一个已调色板化的图像对象列表，稍后我可以通过简单地调用以下方法将其转换回numpy数组：

np.array(ImageList[i].convert('RGB'))

Storing the data as a palettised image saves 66% of the space because you only store one byte of palette index per pixel rather than 3 bytes of RGB, so it is better than the 50% compression you seek. 将数据存储为纯色图像可节省66％的空间，因为每个像素仅存储一个字节的调色板索引，而不是RGB的3个字节，因此它比您寻求的50％压缩更好。

#!/usr/local/bin/python3

import io
import time
import numpy as np
from PIL import Image

# Empty list of images
ImageList = []

# Load all 1,000 images 
for i in range(1000):
   name="s_{}.png".format(i)
   print("Opening image: {}".format(name))
   im = Image.open(name)
   # Add palettised image to list
   ImageList.append(im.quantize(colors=256, method=2))

# Read back images into numpy arrays
start=time.clock()
l=[np.array(ImageList[i].convert('RGB')) for i in range(1000)]
diff=time.clock()-start
print("Time: {}".format(diff))

# Quick test
# Image.fromarray(l[999]).save("result.png")

That now takes 0.2s instead of 2.4s - let's hope the loss of colour accuracy is acceptable to your unstated application :-) 现在，它需要0.2s而不是2.4s-我们希望您未声明的应用程序可以接受颜色精度的损失:-)

内存中的numpy数组（图像，uint8）的有损压缩

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-08-11 14:20:28

内存中的numpy数组（图像，uint8）的有损压缩

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-08-11 14:20:28

解决方案1
2 已采纳 2018-08-11 14:20:28