简体   繁体   English

如何在重启后在python中更快地执行opencv cv2 imread

[英]How to do faster opencv cv2 imread in python after reboot

I have ~650,000 image files that I convert to numpy arrays with cv2. 我有〜650,000个图像文件,我用cv2转换为numpy数组。 The images are arranged into subfolders with ~10k images in each. 图像被排列成子文件夹,每个子文件中有~10k个图像。 Each image is tiny; 每张图片都很小; about 600 bytes (2x100 pixels RGB). 约600字节(2x100像素RGB)。

When I read them all using: 当我全部阅读时使用:

cv2.imread()

It takes half a second per 10k images, under a minute for all 650k... except after I restart my machine. 每10k图像需要半秒钟,所有650k都需要一分钟......除非我重新启动机器。 Then it takes 20-50 seconds per 10k images the first time I run my script after reboot; 然后,重新启动后第一次运行脚本时,每10k图像需要20-50秒; half an hour or so for the full read. 完整阅读半小时左右。

Why? 为什么?

How can I keep them rapidly accessible after restart without the arduously slow initial read? 如何在重启后快速访问它们,而不需要极慢的初始读取?

The database of historic images grows daily; 历史图像数据库每天都在增长; older ones do not get re written. 旧的不会重写。

code: 码:

print 'Building historic database...'
elapsed = elapsed2 = time.time()
def get_immediate_subdirectories(a_dir):
    return [name for name in os.listdir(a_dir)
            if os.path.isdir(os.path.join(a_dir, name))]
compare = get_immediate_subdirectories('images_old')
compare.sort()

images = []
for j in compare:
    begin = 1417024800
    end =  1500000000
    if ASSET == j:
        end = int(time.time()-86400*30)
    tally = 0
    for i in range (begin, end, 7200):
        try:
            im = cv2.imread("images_old/%s/%s_%s.png" % (j,j,i))
            im = np.ndarray.flatten(im)
            if im is not None:  
                images.append([j,i,im])
                tally+=1
        except: pass
    print  j.ljust(5), ('cv2 imread elapsed: %.2f items: %s' % ((time.time()-elapsed),tally))
    elapsed = time.time()
print '%.2f cv2 imread big data: %s X %s items' % ((time.time()-elapsed2),len(images),len(a1))
elapsed = time.time()

amd fm2+ 16GB linux mint 17.3 python 2.7 amd fm2 + 16GB linux mint 17.3 python 2.7

I would like to suggest a concept based on REDIS which is like a database but actually a "data structure server" wherein the data structures are your 600 byte images. 我想建议一个基于REDIS的概念,它就像一个数据库,但实际上是一个“数据结构服务器”,其中数据结构是你的600字节图像。 I am not suggesting for a minute that you rely on REDIS as a permanent storage system, rather continue to use your 650,000 files but cache them in REDIS which is free and available for Linux, macOS and Windows. 我并不建议您依赖REDIS作为永久存储系统,而是继续使用您的650,000个文件,但将它们缓存在REDIS中,这是免费的,可用于Linux,macOS和Windows。

So, basically, at any point in the day, you could copy your images into REDIS ready for the next restart. 因此,基本上,在当天的任何时候,您都可以将图像复制到REDIS中,以备下次重启。

I don't speak Python, but here is a Perl script that I used to generate 650,000 images of 600 random bytes each and insert them into a REDIS hash. 我不会说Python,但这是一个Perl脚本,我用它生成650,000个每个600个随机字节的图像,并将它们插入到REDIS哈希中。 The corresponding Python would be pretty easy to write: 相应的Python很容易编写:

#!/usr/bin/perl
################################################################################
# generator <number of images> <image size in bytes>
# Mark Setchell
# Generates and sends "images" of specified size to REDIS
################################################################################
use strict;
use warnings FATAL => 'all';
use Redis;
use Time::HiRes qw(time);

my $Debug=1;    # set to 1 for debug messages

my $nargs = $#ARGV + 1;
if ($nargs != 2) {
    print "Usage: generator <number of images> <image size in bytes>\n";
    exit 1;
}

my $nimages=$ARGV[0];
my $imsize=$ARGV[1];
my @bytes=(q(a)..q(z),q(A)..q(Z),q(0)..q(9));
my $bl = scalar @bytes - 1;

printf "DEBUG: images: $nimages, size: $imsize\n" if $Debug;

# Connection to REDIS
my $redis = Redis->new;
my $start=time;

for(my $i=0;$i<$nimages;$i++){
   # Generate our 600 byte "image"
   my $image;
   for(my $j=0;$j<$imsize;$j++){
      $image .= $bytes[rand $bl];
   }
   # Load it into a REDIS hash called 'im' indexed by an integer number
   $redis->hset('im',$i,$image);
   print "DEBUG: Sending key:images, field:$i, value:$image\n" if $Debug;
}
my $elapsed=time-$start;
printf "DEBUG: Sent $nimages images of $imsize bytes in %.3f seconds, %d images/s\n",$elapsed,int($nimages/$elapsed)

So, you can insert the 650,000 images of 600 bytes each into a REDIS hash called "im" indexed by a simple number [1..650000]. 因此,您可以将650,000个600字节的图像插入到一个名为“im”的REDIS哈希中,该哈希值由一个简单的数字[1..650000]索引。

Now, if you stop REDIS and check the size of the database, it is 376MB: 现在,如果你停止REDIS并检查数据库的大小,它是376MB:

ls -lhrt dump.rb

-rw-r--r--  1 mark  admin   376M 29 May 20:00 dump.rdb

If you now kill REDIS, and restart it, it takes 2.862 seconds to start and load the 650,000 image database: 如果您现在杀死REDIS并重新启动它,则需要2.862秒来启动并加载650,000个图像数据库:

redis-server /usr/local/etc/redis.conf

                _._                                                  
           _.-``__ ''-._                                             
      _.-``    `.  `_.  ''-._           Redis 3.2.9 (00000000/0) 64 bit
  .-`` .-```.  ```\/    _.,_ ''-._                                   
 (    '      ,       .-`  | `,    )     Running in standalone mode
 |`-._`-...-` __...-.``-._|'` _.-'|     Port: 6379
 |    `-._   `._    /     _.-'    |     PID: 33802
  `-._    `-._  `-./  _.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |           http://redis.io        
  `-._    `-._`-.__.-'_.-'    _.-'                                   
 |`-._`-._    `-.__.-'    _.-'_.-'|                                  
 |    `-._`-._        _.-'_.-'    |                                  
  `-._    `-._`-.__.-'_.-'    _.-'                                   
      `-._    `-.__.-'    _.-'                                       
          `-._        _.-'                                           
              `-.__.-'                                               

33802:M 29 May 20:00:57.698 # Server started, Redis version 3.2.9
33802:M 29 May 20:01:00.560 * DB loaded from disk: 2.862 seconds
33802:M 29 May 20:01:00.560 * The server is now ready to accept connections on port 6379

So, you could start REDIS in under 3 seconds after reboot. 因此,您可以在重启后的3秒内启动REDIS。 Then you can query and load the 650,000 images like this: 然后你可以像这样查询和加载650,000张图片:

#!/usr/bin/perl
################################################################################
# reader
# Mark Setchell
# Reads specified number of images from Redis
################################################################################
use strict;
use warnings FATAL => 'all';
use Redis;
use Time::HiRes qw(time);

my $Debug=0;    # set to 1 for debug messages
my $nargs = $#ARGV + 1;
if ($nargs != 1) {
    print "Usage: reader <number of images>\n";
    exit 1;
}

my $nimages=$ARGV[0];

# Connection to REDIS
my $redis = Redis->new;
my $start=time;

for(my $i=0;$i<$nimages;$i++){
   # Retrive image from hash named "im" with key=$1
   my $image = $redis->hget('im',$i);
   print "DEBUG: Received image $i\n" if $Debug;
}
my $elapsed=time-$start;
printf "DEBUG: Received $nimages images in %.3f seconds, %d images/s\n",$elapsed,int($nimages/$elapsed)

And that reads 650,000 images of 600 bytes each in 61 seconds on my Mac, so your total startup time would be 64 seconds. 在我的Mac上,这会在61秒内读取650,000张600字节的图像,因此您的总启动时间将为64秒。

Sorry, I don't know enough Python yet to do it in Python but I suspect the times would be pretty similar. 对不起,我还不知道用Python做足够的Python,但我怀疑时间会非常相似。

I am basically using a REDIS hash called "im", with hset and hget and am indexing the images by a simple integer. 我基本上使用称为“im”的REDIS哈希,使用hsethget并通过一个简单的整数索引图像。 However, REDIS keys are binary safe, so you could use filenames as keys instead of integers. 但是,REDIS密钥是二进制安全的,因此您可以使用文件名作为键而不是整数。 You can also interact with REDIS at the command-line (without Python or Perl), so you can get a list of the 650,000 keys (filenames) at the command line with: 您还可以在命令行(没有Python或Perl)与REDIS交互,这样您就可以在命令行中获得650,000个键(文件名)的列表:

redis-cli <<< "hkeys im"

or retrieve a single image (with key/filename="1") with: 或检索单个图像(使用key / filename =“1”):

 redis-cli <<< "hget 'im' 1"

If you don't have bash , you could do: 如果你没有bash ,你可以这样做:

echo "hget 'im' 1" | redis-cli

or 要么

echo "hkeys im" | redis-cli

I was just reading about persisting/serializing Numpy arrays, so that may be an even simpler option than involving REDIS... see here . 我刚刚阅读了关于持久化/序列化Numpy数组的内容,因此这可能是一个比REDIS更简单的选项... 请参阅此处

I was thinking overnight and have an even simpler, faster solution... 我一夜之间在想,有一个更简单,更快的解决方案......

Basically, at any point you like during the day, you parse the file system of your existing image files and make a flattened representation of them in two files. 基本上,在白天您喜欢的任何时候,您都会解析现有图像文件的文件系统,并在两个文件中对它们进行展平。 Then, when you start up, you just read the flattened representation which is a single 300MB contiguous file on disk that can be read in 2-3 seconds. 然后,当你启动时,你只需读取平坦的表示,这是磁盘上的一个300MB连续文件,可以在2-3秒内读取。

So, the first file is called "flat.txt" and it contains a single line for each file, like this but actually 650,000 lines long: 因此,第一个文件称为"flat.txt" ,它包含每个文件的一行,如下所示,但实际上是650,000行:

filename:width:height:size
filename:width:height:size
...
filename:width:height:size

The second file is just a binary file with the contents of each of the listed files appended to it - so it is a contiguous 360 MB binary file called "flat.bin" . 第二个文件只是一个二进制文件,其中附加了每个列出的文件的内容 - 因此它是一个连续的360 MB二进制文件,称为"flat.bin"

Here's is how I create the two files in Perl using this script called flattener.pl 以下是我使用名为flattener.pl脚本在Perl创建两个文件的方法

#!/usr/bin/perl
use strict;
use warnings;

use File::Find;

# Names of the index and bin files
my $idxname="flat.txt";
my $binname="flat.bin";

# Open index file, which will have format:
#    fullpath:width:height:size
#    fullpath:width:height:size
open(my $idx,'>',$idxname);

# Open binary file - simply all images concatenated
open(my $bin,'>',$binname);

# Save time we started parsing filesystem
my $atime = my $mtime = time;

find(sub {
  # Only parse actual files (not directories) with extension "png"
  if (-f and /\.png$/) {
    # Get full path filename, filesize in bytes
    my $path   = $File::Find::name;
    my $nbytes = -s;
    # Write name and vital statistics to index file
    print $idx "$path:100:2:$nbytes\n";
    # Slurp entire file and append to binary file
    my $image = do {
       local $/ = undef;
       open my $fh, "<", $path;
       <$fh>;
    };
    print $bin $image;
  }
}, '/path/to/top/directory');

close($idx);
close($bin);

# Set atime and mtime of index file to match time we started parsing
utime $atime, $mtime, $idxname || warn "Couldn't touch $idxname: $!";

Then, when you want to start up, you run the loader.pl which is like this: 然后,当你想要启动时,运行loader.pl ,如下所示:

#!/usr/bin/perl
use strict;
use warnings;

# Open index file, which will have format:
#    fullpath:width:height:size
#    fullpath:width:height:size
open(my $idx, '<', 'flat.txt');

# Open binary file - simply all images concatenated
open(my $bin, '<', 'flat.bin');

# Read index file, one line at a time
my $total=0;
my $nfiles=0;
while ( my $line = <$idx> ) {
    # Remove CR or LF from end of line
    chomp $line;

    # Parse line into: filename, width, height and size
    my ($name,$width,$height,$size) = split(":",$line);

    print "Reading file: $name, $width x $height, bytes:$size\n";
    my $bytes_read = read $bin, my $bytes, $size;
    if($bytes_read != $size){
       print "ERROR: File=$name, expected size=$size, actually read=$bytes_read\n"
    }
    $total += $bytes_read;
    $nfiles++;
}
print "Read $nfiles files, and $total bytes\n";

close($idx);
close($bin);

And that takes under 3 seconds with 497,000 files of 600 bytes each. 这需要不到3秒,每个文件包含497,000个600字节的文件。


So, what about files that have changed since you ran the flattener.pl script. 那么,自从你运行flattener.pl脚本以来已经改变的文件呢。 Well, at the start of the flattener.pl script, I get the system time in seconds since the epoch. 好吧,在flattener.pl脚本的开头,我得到了自纪元以来的系统时间。 Then, at the end, when I have finished parsing 650,000 files and have written the flattened files out, I set their modification time back to just before I started parsing. 然后,最后,当我完成解析650,000个文件并将已展平的文件写出来之后,我将修改时间设置回到我开始解析之前。 Then in your code, all you need to do is load the files using the loader.pl script, then do a quick find of all image files newer than the index file and load those few extra files using your existing method. 然后在您的代码中,您需要做的就是使用loader.pl脚本加载文件,然后快速find比索引文件更新的所有图像文件,并使用现有方法加载这些额外的文件。

In bash , that would be: bash ,那将是:

find . -newer flat.txt -print

As you are reading images with OpenCV , you will need to do an imdecode() on the raw file data, so I would benchmark whether you want to do that whilst flattening or whilst loading. 当您使用OpenCV读取图像时,您需要对原始文件数据执行imdecode() ,因此我会在展平或加载时进行基准测试。


Again, sorry it is in Perl, but I am sure it can be done just the same in Python. 再说一遍,对不起它是在Perl中,但我确信它可以在Python中完成相同的操作。

Did you check that disk is not the bottleneck? 你检查过磁盘不是瓶颈吗? Image files could be cached by OS after the first read and then used from memory. 第一次读取后,操作系统可以缓存图像文件,然后从内存中使用。 If all your files are large enough (10-20Gb) it could take several minutes for slow HDD to read. 如果您的所有文件都足够大(10-20Gb),则慢速HDD可能需要几分钟才能读取。

Have you tried data parallelism on your for j in compare: loop to mitigate the HDD access bottleneck? 您是否尝试过for j in compare: 数据并行性 for j in compare:循环以缓解硬盘访问瓶颈? multiprocessing can be used to perform one task per CPU core (or hardware thread). multiprocessing可用于为每个CPU核心(或硬件线程)执行一个任务。 See this using-multiprocessing-queue-pool-and-locking for some example. 有关示例,请参阅using-multiprocessing-queue-pool-and-locking

If you have a Intel i7 with 8 virtual cores , the elapsed time may reduce to 1/8 theoretically. 如果您的Intel i7具有8 virtual cores ,理论上经过的时间可能会减少到1/8 Actual time shortened would also depend on the access time of your HDD or SSD , and SATA interface type, etc. 缩短实际时间还取决于HDDSSD的访问时间以及SATA接口类型等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM