简体   繁体   English

数据结构存储大量数据?

[英]Data structure to store huge amount of data?

In my application,I have to load volumedata from set of images (MRC images) and keep the pixel data in memory.(images are grayscaled ,so one byte per pixel). 在我的应用程序中,我必须从一组图像(MRC图像)加载volumedata并将像素数据保存在内存中。(图像是灰度级的,因此每像素一个字节)。

My development environment is QT framework ,MinGW for Windows and GCC for Linux. 我的开发环境是QT框架,MinGW for Windows和GCC for Linux。

At the moment,I use a simple data structure to store volumedata as : 目前,我使用一个简单的数据结构来存储volumedata:

unsigned char *volumeData;

and do one huge allocation as follows. 并按如下方式进行大量分配。

volumeData=new unsigned char[imageXsize * imageYsize * numofImages];

Following are the important methods to access image-data in a given plane,such as 以下是访问给定平面中图像数据的重要方法,例如

unsigned char* getXYPlaneSlice(int z_value);
unsigned char* getYZPlaneSlice(int x_value);
unsigned char* getZXPlaneSlice(int y_value);

With my simple data-structure it was easy to implement above methods. 使用我简单的数据结构,很容易实现上述方法。

But we might need to adopt to volume size as 2000x2000x1000 (~3.7Gb) in the future.And current datastructure will not be able to handle that huge data. 但是我们可能需要在未来采用体积大小为2000x2000x1000(~3.7Gb)。而目前的数据结构将无法处理这些庞大的数据。

  1. How to avoid fragmentation ? 如何避免碎片? Now,even with 1000x1000x200 data, application crash giving bad_alloc. 现在,即使使用1000x1000x200数据,应用程序也会崩溃,从而导致bad_alloc。 Whats the best way to change the datastructure for this ? 什么是更改数据结构的最佳方法? shall I use something like linked-list which each chunk is of size 100mb. 我应该使用链接列表,每个块大小为100mb。

  2. Also,user should be able to perfome some image-processing filters on volume-data and also should be able to reset to original pixel value. 此外,用户应该能够在体积数据上执行一些图像处理过滤器,并且还应该能够重置为原始像素值。 That means, I should keep two copies of volume-data. 这意味着,我应该保留两份卷数据。 With current implemetation its like. 与目前的实施类似。

    unsigned char *volumeDataOriginal; unsigned char * volumeDataOriginal;

    unsigned char *volumeDataCurrent; unsigned char * volumeDataCurrent;

So with 2000x2000x1000 data-range its going to utilize about 8Gb (4Gb for each volume). 因此,对于2000x2000x1000数据范围,它将使用大约8Gb(每个音量4Gb)。 But in Win32 , the address space is 4GB.How to tackle this ? 但在Win32中,地址空间为4GB。如何解决这个问题? I should go with 64bit application ? 我应该使用64位应用程序?

EDIT : Here is a snapshot of my application 编辑:这是我的应用程序的快照 在此输入图像描述

Basically,I load the volume-data (from set of images,from MRC format..etc) and display them in different plane-viewers (XY,YX,YZ.Image shows XY-plane-viewer).I need to keep above 3 data-access methods to show an image in a particular plane.using slider-bar user can change which image to show in the selected plane) 基本上,我加载了体积数据(来自MRC格式的图像集),并将它们显示在不同的平面查看器中(XY,YX,YZ。图像显示XY平面查看器)。我需要保持在上面3种数据访问方法,用于显示特定平面中的图像。使用滑块栏用户可以更改要在所选平面中显示的图像)

Thanks in advance. 提前致谢。

I think you should take a look at hdf5 . 我想你应该看看hdf5 This is a binary format for storing huge amounts of data collected from things like telescopes, physics experiments, and gene-sequencing machines. 这是一种二进制格式,用于存储从望远镜,物理实验和基因测序机器等收集的大量数据。 The benefits of using something like this are many, but three immediate thoughts are: (1) tested, (2) supports hyperslab selection, and (3) you get compression for free. 使用这样的东西的好处很多,但有三个直接的想法:(1)测试,(2)支持hyperslab选择,(3)你免费获得压缩。

There are C/C++, java, python, matlab libraries available. 有C / C ++,java,python,matlab库可用。

64 bit is probably the easiest way to handle this... let the OS fault in the pages as you use them. 64位可能是处理此问题的最简单方法...当您使用它们时,让操作系统出现故障。 Otherwise, it's hard to sugegst much without knowing your access patterns through the data. 否则,如果不通过数据了解您的访问模式,就很难成功。 If you're regularly scanning through the images to find the value at the same pixel coordinates, then it's pointless to talk about saying having pointers to images that save and reload on demand. 如果你经常扫描图像以找到相同像素坐标处的值,那么谈论说要指向按需保存和重新加载的图像是毫无意义的。

For undo data, you could keep a full backup copy as you suggest, or you could try to have an undo operation that looks ath the change made and is responsible for finding an efficient implementation. 对于撤消数据,您可以按照建议保留完整备份副本,或者您可以尝试进行撤消操作,以查看所做的更改并负责查找有效的实施。 For example, if you just flipped the bits, then that's non-destructive and you just need a functor to the same bit-flip operation to undo the change. 例如,如果您只是翻转位,那么这是非破坏性的,您只需要一个仿函数来进行相同的位翻转操作来撤消更改。 If setting all the pixels to the same tone was a common operation (eg filling, clearing), then you could have a boolean and a single pixel to encode that image state, and use the full buffer for undos. 如果将所有像素设置为相同的色调是常见操作(例如填充,清除),那么您可以使用布尔值和单个像素对该图像状态进行编码,并使用完整缓冲区进行撤消。

The simplest solution to your problem would be to use 64-bit address spaces - modern Macs support this out of the box, on Windows and Linux you will need to install the 64-bit version of the OS. 解决问题的最简单方法是使用64位地址空间 - 现代Mac支持开箱即用,在Windows和Linux上,您需要安装64位版本的操作系统。 I believe Qt can be used to build 64-bit apps quite nicely. 我相信Qt可以很好地用于构建64位应用程序。 32-bit systems won't be able to support single allocations of the size you're talking about - even a Mac with 4 GB of address space available to applications won't be able to make a single 3.7 GB allocation as there will not be a contiguous space of that size available. 32位系统将无法支持您所讨论的大小的单一分配 - 即使具有4 GB可用地址空间的Mac也无法进行单个3.7 GB分配,因为不会是一个可用的那个大小的连续空间。

For undo I would look at using memory-mapped files and copy-on-write to copy the block: 对于撤消,我会考虑使用内存映射文件和copy-on-write来复制块:

http://en.wikipedia.org/wiki/Copy-on-write http://en.wikipedia.org/wiki/Copy-on-write

This means you don't actually have to copy all the original data, the system will make copies of pages as they are written to. 这意味着您实际上不必复制所有原始数据,系统将在写入页面时复制页面。 This will greatly aid performance if your images are significantly bigger than real memory and you're not changing every part of the image. 如果您的图像比实际内存大得多并且您没有更改图像的每个部分,这将极大地提高性能。 It looks like boost::map_file with "private" access might be helpful for this. 看起来像具有“私有”访问权限的boost :: map_file可能对此有所帮助。

If you really, really need to support 32-bit systems, your only alternative is to break those big blocks down somehow, typically into planes or sub-volumes. 如果你真的,真的需要支持32位系统,你唯一的选择是以某种方式打破那些大块,通常是平面或子卷。 Both are horrible to work with when it comes to applying 3D filters etc. though, so I really would avoid this if you can. 在应用3D滤镜等时,两者都很难用,所以如果可以,我真的会避免这种情况。

If you do go the sub-volume route, one trick is to save all the sub-volumes in memory-mapped files, and map them into your address space only when you need them. 如果你进入子卷路由,一个技巧是将所有子卷保存在内存映射文件中,并仅在需要时将它们映射到地址空间。 When unmapped from the address space they should stay around in the unified buffer cache until purged, effectively this means you can utilise more RAM than you have address space (particularly on Windows where 32-bit applications only get 2 GB of address space by default). 当从地址空间取消映射时,它们应该保留在统一缓冲区高速缓存中直到被清除,这实际上意味着您可以使用比地址空间更多的RAM(特别是在Windows上,默认情况下32位应用程序只能获得2 GB的地址空间) 。

Finally, on 32-bit Windows you can also look at the /3GB switch in boot.ini. 最后,在32位Windows上,您还可以查看boot.ini中的/ 3GB开关。 This allows you to allocate 3 GB of address space to applications rather than the normal 2 GB. 这允许您为应用程序分配3 GB的地址空间,而不是正常的2 GB。 From the problem you describe I don't think this will give you enough address space, however it may help you with some smaller volumes. 从你描述的问题我不认为这会给你足够的地址空间,但它可能会帮助你一些较小的卷。 Note that the /3GB switch can cause problems with some drivers as it reduces the amount of address space available to the kernel. 请注意,/ 3GB开关可能会导致某些驱动程序出现问题,因为它会减少内核可用的地址空间量。

You can use a memory mapped files to manage large datasets with limited memory. 您可以使用内存映射文件来管理内存有限的大型数据集。 However if your file sizes are going to be 4GB then going to 64 bit is recommended. 但是,如果您的文件大小为4GB,则建议使用64位。 The boost project has a good multi-platform memory mapping library that performs very close to what you are looking for. boost项目有一个很好的多平台内存映射库,可以非常接近你想要的。

http://en.wikipedia.org/wiki/Memory-mapped_file http://www.boost.org/doc/libs/1_44_0/libs/iostreams/doc/classes/mapped_file.html to get you started. http://en.wikipedia.org/wiki/Memory-mapped_file http://www.boost.org/doc/libs/1_44_0/libs/iostreams/doc/classes/mapped_file.html让您入门。 Some sample code below -- 以下示例代码 -

#include <boost/iostreams/device/mapped_file.hpp>
boost::iostreams::mapped_file_source input_source;
input_source.open(std::string(argv[1]));
const char *data = input_source.data();
long size = input_source.size();
input_source.close();

Thanks, Nathan 谢谢,内森

One option I would consider is memory mapping, instead of mapping all images, maintain a linked list of images which are lazily loaded. 我要考虑的一个选项是内存映射,而不是映射所有图像,维护一个延迟加载的图像的链接列表。 As your filter works through the image list, load as needed. 当您的过滤器在图像列表中工作时,根据需要加载。 In the loading phase, map an anonymous (or of some fixed temporary file) block of the same size and copy the image there as your backup. 在加载阶段,映射相同大小的匿名(或某个固定临时文件)块,并将映像复制为备份。 And as you apply filters, you just backup to this copy. 当您应用过滤器时,您只需备份到此副本。 As @Tony said above, 64-bit is your best option, and for multi-platform memory mapped files, look at boost interprocess. 正如@Tony所述,64位是您的最佳选择,对于多平台内存映射文件,请查看boost进程。

Use STXXL : Standard Template Library for Extra Large Data Sets. 使用STXXL :超大型数据集的标准模板库。

I first heard about it on SO :) 我第一次听到关于它的SO :)

You could use a two-level structure: An array of pointers to the single images or (much better) a bunch of images. 您可以使用两级结构:指向单个图像的指针数组或(更好)一堆图像。 So you could keep ie 20 images in one memory block and put the pointers to the 20-images-blocks into the array. 因此,您可以在一个内存块中保留20个图像,并将指向20个图像块的指针放入阵列中。 This is still fast (compared to a linked-list) when doing random access. 在进行随机访问时,这仍然很快(与链表相比)。

You can then implement a simple paging-algorithm: At first all pointers in the array are NULL. 然后,您可以实现一个简单的分页算法:首先,数组中的所有指针都是NULL。 When you first access an image-block you load the 20 images of that block into memory and write the pointer into the array. 首次访问图像块时,将该块的20个图像加载到内存中,并将指针写入阵列。 The next access to those images does not load anything. 对这些图像的下一次访问不会加载任何内容。

If your memory gets low because you have loaded and loaded many image-blocks you can remove the image-block you have least used (you should add a second field beside the pointer where you put in the value of a counter that you count up each time you load an image-block). 如果您的内存因为加载并加载了许多图像块而变低,则可以删除最少使用的图像块(您应该在指针旁边添加第二个字段,在该字段中放入计数器的值,每个加载图像块的时间)。 The image-block with the lowest counter is the least used one and can be dropped (memory is reused for the new block and the pointer is set to NULL). 具有最低计数器的图像块是最少使用的图像块并且可以被丢弃(存储器被重新用于新块并且指针被设置为NULL)。

The trend these days in working with very large volume data is to break the data up into smaller data bricks of say 64x64x64. 如今处理大量数据的趋势是将数据分解为更小的数据块,例如64x64x64。 If you want to do volume rendering with lighting, then you should have a 1 voxel overlap between neighboring bricks so that individual bricks can be rendered without needing neighboring bricks. 如果你想用光照进行体绘制,那么你应该在相邻的砖块之间有1个体素重叠,这样就可以渲染单个砖块,而不需要相邻的砖块。 If you want to do more complex image processing with the bricks, then you can increase the overlap (at the expense of storage). 如果您想使用砖块进行更复杂的图像处理,则可以增加重叠(以存储为代价)。

The advantage of this approach is that you only need to load the bricks that are necessary into memory. 这种方法的优点是您只需要将必要的砖块加载到内存中。 The rendering/processing time for a brick-based volume is not significantly slower than a non-bricked base volume. 基于砖的体积的渲染/处理时间并不比非砖块体积明显慢。

For a more involved discussion of this from the volume rendering side, check out papers on the Octreemizer. 有关体积渲染方面的更多参与讨论,请查看Octreemizer上的文章。 Here is a link to one on citeseer . 这是一个关于citeseer的链接

The main problem is probably if you want total random access to your data. 如果您想要对数据进行完全随机访问 ,则可能是主要问题。

The best approach would be to think about the algorithms you want to use, and of they can't be written that the mainly stride through the data in only one direction. 最好的方法是考虑你想要使用的算法 ,并且不能写出只在一个方向上主要跨越数据的算法。 Ok, thats not always possible. 好的,这并不总是可行的。

If you want to code a middle-weight solution yourself, you should do it like this: 如果你想自己编写一个中等重量的解决方案,你应该这样做:

  • use mmap() to map slices of your data structure into memory 使用mmap()将数据结构的切片映射到内存中
  • encapsulate the data in a class, so you can catch access to to currently non-mapped data 将数据封装在一个类中,这样您就可以捕获对当前非映射数据的访问权限
  • mmap() the required region on demand, then. 然后,按需要mmap()所需的区域。

(Actually, this this is what the OS is doing anyway, if you mmap() the whole file at once, but by taking a bit of control, you might make the on-demand algorithm smarter, over time, and fit you requirements). (实际上,这就是操作系统正在做的事情,如果你一次mmap()整个文件,但是通过一些控制,你可能会使按需算法更智能,随着时间的推移,并符合你的要求) 。

Again, this is no fun if you jump around on those image-voxels. 再次,如果你在那些图像体素上跳来跳去,这并不好玩。 Your algorithm must fit the data-access -- for every solution you choose to store your data. 您的算法必须适合数据访问 - 适用于您选择存储数据的每个解决方案。 Total Random Access will "break" everything , if your data is larger then your physical memory. 如果您的数据大于物理内存,则Total Random Access将“中断”所有内容

If hardware and OS allows it, I would go 64 bit, and map the file to memory (see CreateFileMapping on Windows and mmap on Linux). 如果硬件和操作系统允许它,我会去64位,并将文件映射到内存(请参阅Windows上的CreateFileMapping和Linux上的mmap)。

On Windows, you can make a view over the mapped file which allows copy-on-write. 在Windows上,您可以查看允许写入时复制的映射文件。 I'm sure you can get that functionality on Linux as well. 我相信你也可以在Linux上获得这个功能。 Anyway, if you create a read only view over the source file, then that will be your "original data". 无论如何,如果您在源文件上创建只读视图,那么这将是您的“原始数据”。 Then you create a copy-on-write view over the source file - this will be the "current data". 然后在源文件上创建一个写时复制视图 - 这将是“当前数据”。

When you modify current data, the modified underlying pages will be copied and allocated for you, and the pages for the source data will remain intact. 修改当前数据时,将为您复制和分配修改后的基础页面,源数据的页面将保持不变。 If you make sure that you do not write identical data to your "current data", you will also get an optimal usage of memory, because your current data and original data will share memory pages. 如果确保不将相同的数据写入“当前数据”,则还可以获得最佳的内存使用率,因为当前数据和原始数据将共享内存页。 You do have to take page alignment into consideration though, because copy-on-write works on page basis. 但是,您必须考虑页面对齐,因为写入时复制是基于页面的。

Also, reverting from current to original data is a simple job. 此外,从当前数据恢复到原始数据是一项简单的工作。 All you need to do is to recreate the mapping for the "current data". 您需要做的就是重新创建“当前数据”的映射。

By using file mapping, the tedious work of managing memory will be handled by the OS. 通过使用文件映射,管理内存的繁琐工作将由操作系统处理。 It will be able to use all available memory in a very efficient way. 它将能够以非常有效的方式使用所有可用内存。 Way more efficient than you could ever accomplish with normal heap allocations. 比普通堆分配更有效的方法。

I would start by researching CreateFileView() and MapViewOfFile() for use on Windows. 我将首先研究CreateFileView()和MapViewOfFile()以便在Windows上使用。 For Linux you have mmap(), but that's as far as my knowledge goes. 对于Linux你有mmap(),但就我所知而言。 I haven't touched anything *nix since 2000... 自2000年以来我没有触及任何* nix ......

Have a look at SciDB . 看看SciDB I am no expert of it, but from its sample use cases and a paper describing it , it allows you to naturally map your data into a 3D (+1D for time/versioning) array like this: 我不是它的专家,但从它的示例用例描述它的论文中 ,它允许您自然地将数据映射到3D(+ 1D for time / versioning)数组,如下所示:

CREATE ARRAY Pixels [
    x INT,
    y INT,
    z INT,
    version INT
] (
    pixel INT
);

And to implement your query getXYPlaneSlice : 并实现您的查询getXYPlaneSlice

Slice (Pixels, z = 3, version = 1);

To avoid duplication of data when only a part of the data is changed, you do not need to fill the whole array for version 1 since SciDB supports sparse array. 为了避免在仅更改部分数据时重复数据,因为SciDB支持稀疏数组,所以不需要为版本1填充整个数组。 Then when you need to load the newest data, you could load with version = 0 to get the old version, and update the result with another load with version = 1 . 然后,当您需要加载最新数据时,可以使用version = 0加载以获取旧版本,并使用version = 1其他加载更新结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM