如何在重启后在python中更快地执行opencv cv2 imread

Question

我有〜650,000个图像文件，我用cv2转换为numpy数组。 图像被排列成子文件夹，每个子文件中有~10k个图像。 每张图片都很小; 约600字节（2x100像素RGB）。

当我全部阅读时使用：

cv2.imread()

每10k图像需要半秒钟，所有650k都需要一分钟......除非我重新启动机器。 然后，重新启动后第一次运行脚本时，每10k图像需要20-50秒; 完整阅读半小时左右。

为什么？

如何在重启后快速访问它们，而不需要极慢的初始读取？

历史图像数据库每天都在增长; 旧的不会重写。

码：

print 'Building historic database...'
elapsed = elapsed2 = time.time()
def get_immediate_subdirectories(a_dir):
    return [name for name in os.listdir(a_dir)
            if os.path.isdir(os.path.join(a_dir, name))]
compare = get_immediate_subdirectories('images_old')
compare.sort()

images = []
for j in compare:
    begin = 1417024800
    end =  1500000000
    if ASSET == j:
        end = int(time.time()-86400*30)
    tally = 0
    for i in range (begin, end, 7200):
        try:
            im = cv2.imread("images_old/%s/%s_%s.png" % (j,j,i))
            im = np.ndarray.flatten(im)
            if im is not None:  
                images.append([j,i,im])
                tally+=1
        except: pass
    print  j.ljust(5), ('cv2 imread elapsed: %.2f items: %s' % ((time.time()-elapsed),tally))
    elapsed = time.time()
print '%.2f cv2 imread big data: %s X %s items' % ((time.time()-elapsed2),len(images),len(a1))
elapsed = time.time()

amd fm2 + 16GB linux mint 17.3 python 2.7

Answer 1

我想建议一个基于REDIS的概念，它就像一个数据库，但实际上是一个“数据结构服务器”，其中数据结构是你的600字节图像。 我并不建议您依赖REDIS作为永久存储系统，而是继续使用您的650,000个文件，但将它们缓存在REDIS中，这是免费的，可用于Linux，macOS和Windows。

因此，基本上，在当天的任何时候，您都可以将图像复制到REDIS中，以备下次重启。

我不会说Python，但这是一个Perl脚本，我用它生成650,000个每个600个随机字节的图像，并将它们插入到REDIS哈希中。 相应的Python很容易编写：

#!/usr/bin/perl
################################################################################
# generator <number of images> <image size in bytes>
# Mark Setchell
# Generates and sends "images" of specified size to REDIS
################################################################################
use strict;
use warnings FATAL => 'all';
use Redis;
use Time::HiRes qw(time);

my $Debug=1;    # set to 1 for debug messages

my $nargs = $#ARGV + 1;
if ($nargs != 2) {
    print "Usage: generator <number of images> <image size in bytes>\n";
    exit 1;
}

my $nimages=$ARGV[0];
my $imsize=$ARGV[1];
my @bytes=(q(a)..q(z),q(A)..q(Z),q(0)..q(9));
my $bl = scalar @bytes - 1;

printf "DEBUG: images: $nimages, size: $imsize\n" if $Debug;

# Connection to REDIS
my $redis = Redis->new;
my $start=time;

for(my $i=0;$i<$nimages;$i++){
   # Generate our 600 byte "image"
   my $image;
   for(my $j=0;$j<$imsize;$j++){
      $image .= $bytes[rand $bl];
   }
   # Load it into a REDIS hash called 'im' indexed by an integer number
   $redis->hset('im',$i,$image);
   print "DEBUG: Sending key:images, field:$i, value:$image\n" if $Debug;
}
my $elapsed=time-$start;
printf "DEBUG: Sent $nimages images of $imsize bytes in %.3f seconds, %d images/s\n",$elapsed,int($nimages/$elapsed)

因此，您可以将650,000个600字节的图像插入到一个名为“im”的REDIS哈希中，该哈希值由一个简单的数字[1..650000]索引。

现在，如果你停止REDIS并检查数据库的大小，它是376MB：

ls -lhrt dump.rb

-rw-r--r--  1 mark  admin   376M 29 May 20:00 dump.rdb

如果您现在杀死REDIS并重新启动它，则需要2.862秒来启动并加载650,000个图像数据库：

redis-server /usr/local/etc/redis.conf

                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 3.2.9 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 33802
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

33802:M 29 May 20:00:57.698 # Server started, Redis version 3.2.9
33802:M 29 May 20:01:00.560 * DB loaded from disk: 2.862 seconds
33802:M 29 May 20:01:00.560 * The server is now ready to accept connections on port 6379

因此，您可以在重启后的3秒内启动REDIS。 然后你可以像这样查询和加载650,000张图片：

#!/usr/bin/perl
################################################################################
# reader
# Mark Setchell
# Reads specified number of images from Redis
################################################################################
use strict;
use warnings FATAL => 'all';
use Redis;
use Time::HiRes qw(time);

my $Debug=0;    # set to 1 for debug messages
my $nargs = $#ARGV + 1;
if ($nargs != 1) {
    print "Usage: reader <number of images>\n";
    exit 1;
}

my $nimages=$ARGV[0];

# Connection to REDIS
my $redis = Redis->new;
my $start=time;

for(my $i=0;$i<$nimages;$i++){
   # Retrive image from hash named "im" with key=$1
   my $image = $redis->hget('im',$i);
   print "DEBUG: Received image $i\n" if $Debug;
}
my $elapsed=time-$start;
printf "DEBUG: Received $nimages images in %.3f seconds, %d images/s\n",$elapsed,int($nimages/$elapsed)

在我的Mac上，这会在61秒内读取650,000张600字节的图像，因此您的总启动时间将为64秒。

对不起，我还不知道用Python做足够的Python，但我怀疑时间会非常相似。

我基本上使用称为“im”的REDIS哈希，使用hset和hget并通过一个简单的整数索引图像。 但是，REDIS密钥是二进制安全的，因此您可以使用文件名作为键而不是整数。 您还可以在命令行（没有Python或Perl）与REDIS交互，这样您就可以在命令行中获得650,000个键（文件名）的列表：

redis-cli <<< "hkeys im"

或检索单个图像（使用key / filename =“1”）：

 redis-cli <<< "hget 'im' 1"

如果你没有bash ，你可以这样做：

echo "hget 'im' 1" | redis-cli

要么

echo "hkeys im" | redis-cli

我刚刚阅读了关于持久化/序列化Numpy数组的内容，因此这可能是一个比REDIS更简单的选项... 请参阅此处。

Answer 2

我一夜之间在想，有一个更简单，更快的解决方案......

基本上，在白天您喜欢的任何时候，您都会解析现有图像文件的文件系统，并在两个文件中对它们进行展平。 然后，当你启动时，你只需读取平坦的表示，这是磁盘上的一个300MB连续文件，可以在2-3秒内读取。

因此，第一个文件称为"flat.txt" ，它包含每个文件的一行，如下所示，但实际上是650,000行：

filename:width:height:size
filename:width:height:size
...
filename:width:height:size

第二个文件只是一个二进制文件，其中附加了每个列出的文件的内容 - 因此它是一个连续的360 MB二进制文件，称为"flat.bin" 。

以下是我使用名为flattener.pl脚本在Perl创建两个文件的方法

#!/usr/bin/perl
use strict;
use warnings;

use File::Find;

# Names of the index and bin files
my $idxname="flat.txt";
my $binname="flat.bin";

# Open index file, which will have format:
#    fullpath:width:height:size
#    fullpath:width:height:size
open(my $idx,'>',$idxname);

# Open binary file - simply all images concatenated
open(my $bin,'>',$binname);

# Save time we started parsing filesystem
my $atime = my $mtime = time;

find(sub {
  # Only parse actual files (not directories) with extension "png"
  if (-f and /\.png$/) {
    # Get full path filename, filesize in bytes
    my $path   = $File::Find::name;
    my $nbytes = -s;
    # Write name and vital statistics to index file
    print $idx "$path:100:2:$nbytes\n";
    # Slurp entire file and append to binary file
    my $image = do {
       local $/ = undef;
       open my $fh, "<", $path;
       <$fh>;
    };
    print $bin $image;
  }
}, '/path/to/top/directory');

close($idx);
close($bin);

# Set atime and mtime of index file to match time we started parsing
utime $atime, $mtime, $idxname || warn "Couldn't touch $idxname: $!";

然后，当你想要启动时，运行loader.pl ，如下所示：

#!/usr/bin/perl
use strict;
use warnings;

# Open index file, which will have format:
#    fullpath:width:height:size
#    fullpath:width:height:size
open(my $idx, '<', 'flat.txt');

# Open binary file - simply all images concatenated
open(my $bin, '<', 'flat.bin');

# Read index file, one line at a time
my $total=0;
my $nfiles=0;
while ( my $line = <$idx> ) {
    # Remove CR or LF from end of line
    chomp $line;

    # Parse line into: filename, width, height and size
    my ($name,$width,$height,$size) = split(":",$line);

    print "Reading file: $name, $width x $height, bytes:$size\n";
    my $bytes_read = read $bin, my $bytes, $size;
    if($bytes_read != $size){
       print "ERROR: File=$name, expected size=$size, actually read=$bytes_read\n"
    }
    $total += $bytes_read;
    $nfiles++;
}
print "Read $nfiles files, and $total bytes\n";

close($idx);
close($bin);

这需要不到3秒，每个文件包含497,000个600字节的文件。

那么，自从你运行flattener.pl脚本以来已经改变的文件呢。 好吧，在flattener.pl脚本的开头，我得到了自纪元以来的系统时间。 然后，最后，当我完成解析650,000个文件并将已展平的文件写出来之后，我将修改时间设置回到我开始解析之前。 然后在您的代码中，您需要做的就是使用loader.pl脚本加载文件，然后快速find比索引文件更新的所有图像文件，并使用现有方法加载这些额外的文件。

在bash ，那将是：

find . -newer flat.txt -print

当您使用OpenCV读取图像时，您需要对原始文件数据执行imdecode() ，因此我会在展平或加载时进行基准测试。

再说一遍，对不起它是在Perl中，但我确信它可以在Python中完成相同的操作。

Answer 3

你检查过磁盘不是瓶颈吗？ 第一次读取后，操作系统可以缓存图像文件，然后从内存中使用。 如果您的所有文件都足够大（10-20Gb），则慢速HDD可能需要几分钟才能读取。

Answer 4

您是否尝试过for j in compare: 数据并行性 for j in compare:循环以缓解硬盘访问瓶颈？ multiprocessing可用于为每个CPU核心（或硬件线程）执行一个任务。 有关示例，请参阅using-multiprocessing-queue-pool-and-locking 。

如果您的Intel i7具有8 virtual cores ，理论上经过的时间可能会减少到1/8 。 缩短实际时间还取决于HDD或SSD的访问时间以及SATA接口类型等。

如何在重启后在python中更快地执行opencv cv2 imread

问题描述

4 个解决方案

解决方案1
2 2017-05-29 19:35:21

解决方案2
2 2017-05-30 13:04:09

解决方案3
1 2017-05-29 13:53:31

解决方案4
0 2017-05-29 15:38:56

如何在重启后在python中更快地执行opencv cv2 imread

问题描述

4 个解决方案

解决方案1 2 2017-05-29 19:35:21

解决方案2 2 2017-05-30 13:04:09

解决方案3 1 2017-05-29 13:53:31

解决方案4 0 2017-05-29 15:38:56

解决方案1
2 2017-05-29 19:35:21

解决方案2
2 2017-05-30 13:04:09

解决方案3
1 2017-05-29 13:53:31

解决方案4
0 2017-05-29 15:38:56