简体   繁体   English

在MATLAB中将大型数组直接写入磁盘时,是否需要预先分配?

[英]When writing a large array directly to disk in MATLAB, is there any need to preallocate?

I need to write an array that is too large to fit into memory to a .mat binary file. 我需要编写一个太大的数组,以适应.mat二进制文件的内存。 This can be accomplished with the matfile function, which allows random access to a .mat file on disk. 这可以通过matfile函数来完成,该函数允许随机访问磁盘上的.mat文件。

Normally, the accepted advice is to preallocate arrays, because expanding them on every iteration of a loop is slow. 通常,接受的建议是预分配数组,因为在循环的每次迭代中扩展它们都很慢。 However, when I was asking how to do this , it occurred to me that this may not be good advice when writing to disk rather than RAM. 但是,当我问如何做到这一点时,我发现在写入磁盘而不是RAM时这可能不是一个好建议。

Will the same performance hit from growing the array apply , and if so, will it be significant when compared to the time it takes to write to disk anyway? 数组增长会产生相同的性能影响如果是这样,那么与写入磁盘所需的时间相比 ,它会显着吗?

(Assume that the whole file will be written in one session , so the risk of serious file fragmentation is low .) (假设整个文件将在一个会话中写入,因此严重文件碎片的风险很低 。)

Q: Will the same performance hit from growing the array apply, and if so will it be significant when compared to the time it takes to write to disk anyway? 问:增长阵列会产生相同的性能影响,如果是这样,那么与写入磁盘所需的时间相比,它会显着吗?

A: Yes, performance will suffer if you significantly grow a file on disk without pre-allocating. 答:是的,如果您在没有预先分配的情况下显着增加磁盘上的文件,性能将受到影响。 The performance hit will be a consequence of fragmentation. 性能受到影响将是碎片化的结果。 As you mentioned, fragmentation is less of a risk if the file is written in one session, but will cause problems if the file grows significantly. 正如您所提到的,如果文件是在一个会话中编写的,则碎片风险较小,但如果文件显着增长则会导致问题。

A related question was raised on the MathWorks website, and the accepted answer was to pre-allocate when possible. MathWorks网站上提出了一个相关问题 ,接受的答案是尽可能预先分配。

If you don't pre-allocate, then the extent of your performance problems will depend on: 如果您没有预先分配,那么您的性能问题的程度将取决于:

  • your filesystem (how data are stored on disk, the cluster-size), 你的文件系统(数据如何存储在磁盘上,簇大小),
  • your hardware (HDD seek time, or SSD access times), 您的硬件(硬盘寻道时间或SSD访问时间),
  • the size of your mat file (whether it moves into non-contiguous space), mat文件的大小(是否移动到非连续空间),
  • and the current state of your storage (existing fragmentation / free space). 和存储的当前状态(现有碎片/可用空间)。

Let's pretend that you're running a recent Windows OS, and so are using the NTFS file-system . 让我们假装您正在运行最近的Windows操作系统,因此使用NTFS文件系统 Let's further assume that it has been set up with the default 4 kB cluster size. 让我们进一步假设它已经设置为默认的4 kB簇大小。 So, space on disk gets allocated in 4 kB chunks and the locations of these are indexed to the Master File Table. 因此,磁盘空间以4 kB块的形式分配,并将这些位置索引到主文件表。 If the file grows and contiguous space is not available then there are only two choices: 如果文件增长且连续空间不可用,则只有两种选择:

  1. Re-write the entire file to a new part of the disk, where there is sufficient free space. 将整个文件重新写入磁盘的新部分,其中有足够的可用空间。
  2. Fragment the file, storing the additional data at a different physical location on disk. 对文件进行分段,将其他数据存储在磁盘上的不同物理位置。

The file system chooses to do the least-bad option, #2, and updates the MFT record to indicate where the new clusters will be on disk. 文件系统选择执行最不好的选项#2,并更新MFT记录以指示新群集在磁盘上的位置。

从WindowsITPro上NTFS上的碎片文件的插图

Now, the hard disk needs to physically move the read head in order to read or write the new clusters, and this is a (relatively) slow process. 现在,硬盘需要物理地移动读头以便读取或写入新簇,这是一个(相对)慢的过程。 In terms of moving the head, and waiting for the right area of disk to spin underneath it ... you're likely to be looking at a seek time of about 10ms . 在移动磁头方面,等待磁盘的正确区域在其下面旋转......你可能正在寻找大约10ms的寻道时间。 So for every time you hit a fragment, there will be an additional 10ms delay whilst the HDD moves to access the new data. 因此,每次敲击碎片时,HDD都会有额外的10ms延迟,以便访问新数据。 SSDs have much shorter seek times (no moving parts). SSD具有更短的寻道时间(没有移动部件)。 For the sake of simplicity, we're ignoring multi-platter systems and RAID arrays! 为简单起见,我们忽略了多盘系统和RAID阵列!

If you keep growing the file at different times, then you may experience a lot of fragmentation. 如果你在不同的时间继续增长文件,那么你可能会遇到很多碎片。 This really depends on when / how much the file is growing by, and how else you are using the hard disk. 这实际上取决于文件的增长时间/数量,以及您使用硬盘的方式。 The performance hit that you experience will also depend on how often you are reading the file, and how frequently you encounter the fragments. 您遇到的性能影响还取决于您阅读文件的频率以及您遇到片段的频率。

MATLAB stores data in Column-major order , and from the comments it seems that you're interested in performing column-wise operations (sums, averages) on the dataset. MATLAB以列主要顺序存储数据,并且从注释中看起来您有兴趣对数据集执行逐列操作(总和,平均值)。 If the columns become non-contiguous on disk then you're going to hit lots of fragments on every operation! 如果列在磁盘上变得不连续,那么你将在每次操作时击中大量碎片!

As mentioned in the comments, both read and write actions will be performed via a buffer. 如评论中所述,读取和写入操作都将通过缓冲区执行。 As @user3666197 points out the OS can speculatively read-ahead of the current data on disk, on the basis that you're likely to want that data next. 正如@ user3666197指出的那样,操作系统可以推测性地预读磁盘上的当前数据,因为您可能希望接下来的数据。 This behaviour is especially useful if the hard disk would be sitting idle at times - keeping it operating at maximum capacity and working with small parts of the data in buffer memory can greatly improve read and write performance. 如果硬盘有时处于空闲状态 - 保持其以最大容量运行并且在缓冲存储器中处理小部分数据可以极大地提高读写性能,则此行为尤其有用。 However, from your question it sounds as though you want to perform large operations on a huge (too big for memory) .mat file. 但是,从你的问题来看,听起来好像你想对一个巨大的(太大的内存).mat文件执行大型操作。 Given your use-case, the hard disk is going to be working at capacity anyway , and the data file is too big to fit in the buffer - so these particular tricks won't solve your problem. 鉴于您的用例,硬盘无论如何都将以容量工作,而且数据文件太大而无法容纳缓冲区 - 因此这些特殊技巧无法解决您的问题。

So ...Yes, you should pre-allocate. 所以......是的,你应该预先分配。 Yes, a performance hit from growing the array on disk will apply. 是的,在磁盘上增加阵列的性能会受到影响。 Yes, it will probably be significant (it depends on specifics like amount of growth, fragmentation, etc). 是的,它可能会很重要(这取决于具体的增长量,碎片等)。 And if you're going to really get into the HPC spirit of things then stop what you're doing, throw away MATLAB , shard your data and try something like Apache Spark! 如果你真的要深入了解HPC的精神,那么就停止你正在做的事情,抛弃MATLAB,破坏你的数据并尝试类似Apache Spark的东西! But that's another story. 但这是另一个故事。

Does that answer your question? 这是否回答你的问题?

PS Corrections / amendments welcome! PS更正/修正欢迎! I was brought up on POSIX inodes, so sincere apologies if there are any inaccuracies in here... 我是在POSIX inode上长大的,如果这里有任何不准确之处,我真诚地道歉......

Preallocating a variable in RAM and preallocating on the disk don't solve the same problem. 在RAM中预先分配变量并在磁盘上预分配不能解决同样的问题。

In RAM 在RAM中

To expand a matrix in RAM, MATLAB creates a new matrix with the new size and copies the values of the old matrix into the new one and deletes the old one. 为了在RAM中扩展矩阵,MATLAB创建一个具有新大小的新矩阵,并将旧矩阵的值复制到新矩阵中并删除旧矩阵。 This costs a lot of performance. 这会花费很多性能。

If you preallocated the matrix, the size of it does not change. 如果预先分配了矩阵,则其大小不会改变。 So there is no more reason for MATLAB to do this matrix copying anymore. 所以没有理由让MATLAB再次进行矩阵复制。

On the hard-disk 在硬盘上

The problem on the hard-disk is fragmentation as GnomeDePlume said. GnomeDePlume说,硬盘上的问题是碎片化 Fragmentation will still be a problem, even if the file is written in one session. 即使文件是在一个会话中编写的,碎片仍然是一个问题。

Here is why: The hard disk will generally be a little fragmentated. 原因如下:硬盘通常会有点碎片化。 Imagine 想像

  • # to be memory blocks on the hard disk that are full #是硬盘上已满的内存块
  • M to be memory blocks on the hard disk that will be used to save data of your matrix M是硬盘上用于保存矩阵数据的内存块
  • - to be free memory blocks on the hard disk -成为硬盘上的空闲内存块

Now the hard disk could look like this before you write the matrix onto it: 现在,在将矩阵写入其中之前,硬盘可能看起来像这样:

###--##----#--#---#--------------------##-#---------#---#----#------

When you write parts of the matrix (eg MMM blocks) you could imagine the process to look like this >!(I give an example where the file system will just go from left to right and use the first free space that is big enough - real file systems are different): 当您编写矩阵的某些部分(例如MMM块)时,您可以想象这个过程看起来像这样>!(我举一个例子,文件系统将从左到右依次使用第一个足够大的空闲空间 -真实的文件系统是不同的):

  1. First matrix part: 第一个矩阵部分:
    ###--##MMM-#--#---#--------------------##-#---------#---#----#------
  2. Second matrix part: ###--##MMM-#--#MMM#--------------------##-#---------#---#----#------ 第二个矩阵部分: ###--##MMM-#--#MMM#--------------------##-#---------#---#----#------
  3. Third matrix part: ###--##MMM-#--#MMM#MMM-----------------##-#---------#---#----#------ 第三个矩阵部分: ###--##MMM-#--#MMM#MMM-----------------##-#---------#---#----#------
  4. And so on ... 等等 ...

Clearly the matrix file on the hard disk is fragmented although we wrote it without doing anything else in the meantime. 显然,虽然我们在没有做任何其他事情的情况下编写它,但硬盘上的矩阵文件仍然是碎片化的。

This can be better if the matrix file was preallocated. 如果矩阵文件已预先分配,这可能会更好。 In other words, we tell the file system how big our file would be, or in this example, how many memory blocks we want to reserve for it. 换句话说,我们告诉文件系统我们的文件有多大,或者在这个例子中,我们要为它保留多少内存块。

Imagine the matrix needed 12 blocks: MMMMMMMMMMMM . 想象一下矩阵需要12个块: MMMMMMMMMMMM We tell the file system that we need so much by preallocating and it will try to accomodate our needs as best as it can. 我们通过预先分配告诉文件系统我们需要这么多,它将尽可能地满足我们的需求。 In this example, we are lucky: There is free space with >= 12 memory blocks. 在这个例子中,我们很幸运:有大于= 12个内存块的可用空间。

  1. Preallocating (We need 12 memory blocks): 预分配(我们需要12个内存块):
    ###--##----#--#---# (------------) --------##-#---------#---#----#------
    The file system reserves the space between the parentheses for our matrix and will write into there. 文件系统保留矩阵的括号之间的空格并写入那里。
  2. First matrix part: 第一个矩阵部分:
    ###--##----#--#---# (MMM---------) --------##-#---------#---#----#------
  3. Second matrix part: 第二个矩阵部分:
    ###--##----#--#---# (MMMMMM------) --------##-#---------#---#----#------
  4. Third matrix part: 第三个矩阵部分:
    ###--##----#--#---# (MMMMMMMMM---) --------##-#---------#---#----#------
  5. Fourth and last part of the matrix: 矩阵的第四部分和最后部分:
    ###--##----#--#---# (MMMMMMMMMMMM) --------##-#---------#---#----#------

Voilá, no fragmentation! Voilá,没有碎片!


Analogy 比喻

Generally you could imagine this process as buying cinema tickets for a large group. 一般来说,您可以将此过程想象为购买大型团体的电影票。 You would like to stick together as a group, but there are already some seats in the theatre reserved by other people. 你想团结在一起,但是剧院里已经有一些座位由其他人保留。 For the cashier to be able to accomodate to your request (large group wants to stick together), he/she needs knowledge about how big your group is (preallocating). 为了使收银员能够满足您的要求(大型团队想要团结在一起),他/她需要了解您的团队有多大(预分配)。

A quick answer to the whole discussion (in case you do not have the time to follow or the technical understanding): 快速回答整个讨论(如果您没有时间关注或技术理解):

  • Pre-allocation in Matlab is relevant for operations in RAM. Matlab中的预分配与RAM中的操作相关。 Matlab does not give low-level access to I/O operations and thus we cannot talk about pre-allocating something on disk. Matlab不提供对I / O操作的低级访问,因此我们不能谈论在磁盘上预先分配内容。
  • When writing a big amount of data to disk, it has been observed that the fewer the number of writes, the faster is the execution of the task and smaller is the fragmentation on disk. 在向磁盘写入大量数据时,已经发现写入次数越少,执行任务越快,磁盘碎片越少。

Thus, if you cannot write in one go, split the writes in big chunks . 因此,如果您不能一次写入, 请将写入分成大块

Prologue 序幕

This answer is based on both the original post and the clarifications ( both ) provided by the author during the recent week. 这个答案基于作者在最近一周提供的原始帖子和澄清 (两者)。

The question of adverse performance hit (s) introduced by a low-level, physical-media-dependent, "fragmentation" , introduced by both a file-system & file-access layers is further confronted both in a TimeDOMAIN magnitudes and in ComputingDOMAIN repetitiveness of these with the real-use problems of such an approach. 由文件系统和文件访问层引入的低级,物理媒体相关的“碎片”引入的不良性能影响问题在TimeDOMAIN幅度和ComputingDOMAIN重复性中进一步面临其中这种方法的实际问题

Finally a state-of-art, principally fastest possible solution to the given task was proposed , so as to minimise damages from both wasted efforts and mis-interpretation errors from idealised or otherwise not valid assumptions, alike that a risk of "serious file fragmentation is low" due to an assumption, that the whole file will be written in one session ( which is simply principally not possible during many multi-core / multi-process operations of the contemporary O/S in real-time over a time-of-creation and a sequence of extensive modification(s) ( ref. the MATLAB size limits ) of a TB-sized BLOB file-object(s) inside contemporary COTS FileSystems ). 最后提出了针对给定任务的最先进的, 主要是最快的可能解决方案 ,以便最大限度地减少浪费的努力和错误解释错误对理想化或其他无效假设造成的损害,类似于“严重文件碎片化”的风险由于一个假设,整个文件将在一个会话中写入(这在当代O / S的许多多核/多进程操作中实际上在一段时间内实际上是不可能的) -creation和一系列广泛的修改(参考MATLAB大小限制)在当代COTS文件系统内的TB大小的BLOB文件对象。


One may hate the facts, however the facts remain true out there until a faster & better method moves in 人们可能会讨厌事实,但事实仍然存在,直到更快更好的方法进入


First, before considering performance, realise the gaps in the concept 首先,在考虑性能之前,要实现概念上的差距

  1. The real performance adverse hit is not caused by HDD-IO or related to the file fragmentation 真正的性能不利影响不是由HDD-IO 引起的也不与文件碎片有关

  2. RAM is not an alternative for the semi-permanent storage of the .mat file RAM 不是 .mat文件的半永久存储的替代方案

  3. Additional operating system limits and interventions + additional driver and hardware-based abstractions were ignored from assumptions on un-avoidable overheads 额外的操作系统限制和干预 +额外的驱动程序和基于硬件的抽象被忽略了对不可避免的开销的假设
  4. The said computational scheme was omited from the review of what will have the biggest impact / influence on the resulting performance 所述计算方案从对最终性能的最大影响/影响的评论中省略

Given: 鉴于:

  • The whole processing is intended to be run just once , no optimisation / iterations, no continuous processing 整个处理只运行一次 ,没有优化/迭代,没有连续处理

  • Data have 1E6 double Float-values x 1E5 columns = about 0.8 TB (+ HDF5 overhead) 数据有1E6 double浮点值x 1E5列=约0.8 TB (+ HDF5开销)

  • In spite of original post, there is no random IO associated with the processing 尽管有原始帖子,但没有与处理相关的随机IO

  • Data acquisition phase communicates with a .NET to receive DataELEMENT s into MATLAB 数据采集阶段与.NET通信,以将DataELEMENT接收到MATLAB中

    That means, since v7.4, 这意味着,从v7.4开始,

    a 1.6 GB limit on MATLAB WorkSpace in a 32bit Win ( 2.7 GB with a 3GB switch ) 基于32位Win的MATLAB WorkSpace1.6 GB limit (带3GB交换机的2.7 GB)

    a 1.1 GB limit on MATLAB biggest Matrix in wXP / 1.4 GB wV / 1.5 GB 对于wXP / 1.4 GB wV / 1.5 GB中MATLAB最大矩阵1.1 GB limit

    a bit "released" 2.6 GB limit on MATLAB WorkSpace + 2.3 GB limit on a biggest Matrix in a 32bit Linux O/S. 对于32位Linux操作系统中最大的Matrix,MATLAB WorkSpace + 2.3 GB限制有点“释放” 2.6 GB limit

    Having a 64bit O/S will not help any kind of a 32bit MATLAB 7.4 implementation and will fail to work due to another limit , the maximum number of cells in array, which will not cover the 1E12 requested here. 具有64位O / S将无法帮助任何类型的32位MATLAB 7.4实现,并且由于另一个限制 (阵列中的最大单元数量) 将无法工作 ,这将不包括此处请求的1E12。

    The only chance is to have both 唯一的机会是两者都有

  • Data storage phase assumes block-writes of a row-ordered data blocks ( a collection of row-ordered data blocks ) into a MAT-file on an HDD-device 数据存储阶段假定行排序数据块(行排序数据块的集合)的块写入到HDD设备上的MAT-file

  • Data processing phase assumes to re-process the data in a MAT-file on an HDD-device, after all inputs have been acquired and marshalled to a file-based off-RAM-storage, but in a column-ordered manner 数据处理阶段假定在获取所有输入并将所有输入编组到基于文件的脱RAM存储之后,在HDD设备上的MAT-file重新处理数据,但是以列顺序的方式

  • just column-wise mean() -s / max() -es are needed to calculate ( nothing more complex ) 只需按列计算mean() s / max() es需要计算( 不再复杂

Facts: 事实:

  • MATLAB uses a "restricted" implementation of an HDF5 file-structure for binary files. MATLAB对二进制文件使用HDF5文件结构的“受限制”实现。

Review performance measurements on real-data & real-hardware ( HDD + SSD ) to get feeling of scales of the un-avoidable weaknesses thereof 检查实际数据和真实硬件(HDD + SSD)的性能测量,以获得其不可避免的弱点的尺度感

The Hierarchical Data Format ( HDF ) was born on 1987 at the National Center for Supercomputing Applications ( NCSA ), some 20 years ago. 分层数据格式( HDF )诞生于1987年的国家超级计算应用中心( NCSA ),大约20年前。 Yes, that old. 是的,那个老了。 The goal was to develop a file format that combine flexibility and efficiency to deal with extremely large datasets. 目标是开发一种文件格式,结合灵活性效率来处理极大的数据集。 Somehow the HDF file was not used in the mainstream as just a few industries were indeed able to really make use of it's terrifying capacities or simply did not need them. 不知何故HDF文件没有在主流中使用,因为只有少数几个行业确实能够真正利用它的可怕能力或根本不需要它们。

FLEXIBILITY means that the file-structure bears some overhead, one need not use if the content of the array is not changing ( you pay the cost without consuming any benefit of using it ) and an assumption, that HDF5 limits on overall size of the data it can contain sort of helps and saves the MATLAB side of the problem is not correct. 灵活性意味着文件结构带来一些开销,如果数组内容没有变化就不需要使用(你付出成本而不消耗使用它的任何好处),并假设HDF5限制了数据的整体大小它可以包含一些帮助并保存MATLAB方面的问题是不正确的。

MAT-files are good in principle, as they avoid an otherwise persistent need to load a whole file into RAM to be able to work with it. MAT-files原则上是好的,因为它们避免了需要将整个文件加载到RAM中以便能够使用它。

Nevertheless, MAT-files are not serving well the simple task as was defined and clarified here. 然而, MAT-files不能很好地完成这里定义和澄清的简单任务。 An attempt to do that will result in just a poor performance and HDD-IO file-fragmentation ( adding a few tens of milliseconds during write-through -s and something less than that on read-ahead -s during the calculations ) will not help at all in judging the core-reason for the overall poor performance. 试图这样做只会导致性能不佳和HDD-IO文件碎片(在write-through期间增加几十毫秒,而在计算期间少于read-ahead )将无济于事根本就是判断整体表现不佳的核心原因。


A professional solution approach 专业的解决方案

Rather than moving the whole gigantic set of 1E12 DataELEMENT s into a MATLAB in-memory proxy data array, that is just scheduled for a next coming sequenced stream of HDF5 / MAT-file HDD-device IO-s ( write-through s and O/S vs. hardware-device-chain conflicting/sub-optimised read-ahead s ) so as to have all the immenses work "just [married] ready" for a few & trivially simple calls of mean() / max() MATLAB functions( that will do their best to revamp each of the 1E12 DataELEMENT s in just another order ( and even TWICE -- yes -- another circus right after the first job-processing nightmare gets all the way down, through all the HDD-IO bottlenecks ) back into MATLAB in-RAM-objects, do redesign this very step into a pipe-line BigDATA processing from the very beginning. 而不是将整个庞大的1E12 DataELEMENT集合移动到MATLAB内存代理数据阵列中,而不是仅为下一个即将到来的HDF5 / MAT-file HDD设备IO-s( write-through s和O)序列流安排/ S与硬件 - 设备链冲突/次优化预read-ahead s)以便让所有的巨大工作“只是[已婚]已准备好”进行一些简单的调用mean() / max() MATLAB功能(将尽力以另一种顺序改进每个1E12 DataELEMENT甚至是TWICE - 是的 -在第一个作业处理梦魇完全停止后, 通过所有HDD-IO的另一个马戏团) 瓶颈 )回到MATLAB in-RAM-objects中, 从一开始就将这一步重新设计 为管道BigDATA处理。

while true                                          % ref. comment Simon W Oct 1 at 11:29
   [ isStillProcessingDotNET,   ...                 %      a FLAG from .NET reader function
     aDotNET_RowOfVALUEs ...                        %      a ROW  from .NET reader function
     ] = GetDataFromDotNET( aDtPT )                 %                  .NET reader
   if ( isStillProcessingDotNET )                   % Yes, more rows are still to come ...
      aRowCOUNT = aRowCOUNT + 1;                    %      keep .INC for aRowCOUNT ( mean() )
      for i = 1:size( aDotNET_RowOfVALUEs )(2)      %      stepping across each column
         aValue     = aDotNET_RowOfVALUEs(i);       %      
         anIncrementalSumInCOLUMN(i) = ...
         anIncrementalSumInCOLUMN(i) + aValue;      %      keep .SUM for each column ( mean() )
         if ( aMaxInCOLUMN(i) < aValue )            %      retest for a "max.update()"
              aMaxInCOLUMN(i) = aValue;             %      .STO a just found "new" max
         end
      endfor
      continue                                      %      force re-loop
   else
      break
   endif
end
%-------------------------------------------------------------------------------------------
% FINALLY:
% all results are pre-calculated right at the end of .NET reading phase:
%
% -------------------------------
% BILL OF ALL COMPUTATIONAL COSTS ( for given scales of 1E5 columns x 1E6 rows ):
% -------------------------------
% HDD.IO:          **ZERO**
% IN-RAM STORAGE:
%                  Attr Name                       Size                     Bytes  Class
%                  ==== ====                       ====                     =====  =====
%                       aMaxInCOLUMNs              1x100000                800000  double
%                       anIncrementalSumInCOLUMNs  1x100000                800000  double
%                       aRowCOUNT                  1x1                          8  double
%
% DATA PROCESSING:
%
% 1.000.000x .NET row-oriented reads ( same for both the OP and this, smarter BigDATA approach )
%         1x   INT   in aRowCOUNT,                 %%       1E6 .INC-s
%   100.000x FLOATs  in aMaxInCOLUMN[]             %% 1E5 * 1E6 .CMP-s
%   100.000x FLOATs  in anIncrementalSumInCOLUMN[] %% 1E5 * 1E6 .ADD-s
% -----------------
% about 15 sec per COLUMN of 1E6 rows
% -----------------
%                  --> mean()s are anIncrementalSumInCOLUMN./aRowCOUNT
%-------------------------------------------------------------------------------------------
% PIPE-LINE-d processing takes in TimeDOMAIN "nothing" more than the .NET-reader process
%-------------------------------------------------------------------------------------------

Your pipe-line d BigDATA computation strategy will in a smart way principally avoid interim storage buffering in MATLAB as it will progressively calculate the results in not more than about 3 x 1E6 ADD/CMP-registers, all with a static layout, avoid proxy-storage into HDF5 / MAT-file , absolutely avoid all HDD-IO related bottlenecks and low BigDATA sustained-read-s' speeds ( not speaking at all about interim/BigDATA sustained-writes... ) and will also avoid ill-performing memory-mapped use just for counting mean-s and max-es. 您的管道 d BigDATA计算策略将以智能方式主要避免 MATLAB中的临时存储缓冲,因为它将逐步计算不超过约3 x 1E6 ADD / CMP寄存器的结果,所有这些都具有静态布局, 避免代理 -存储到HDF5 / MAT-file绝对避免所有与HDD-IO相关的瓶颈和低BigDATA持续读取的速度(完全没有谈到临时/ BigDATA持续写入...)并且还可以避免不良内存-mapped仅用于计算mean-s和max-es。


Epilogue 结语

The pipeline processing is nothing new under the Sun. 太阳下​​的管道处理并不是什么新鲜事。

It re-uses what speed-oriented HPC solutions already use for decades 它重复使用速度导向的HPC解决方案已经使用了几十年

[ generations before BigDATA tag has been "invented" in Marketing Dept's. [BigDATA标签之前的几代人已经在营销部门“发明”了。 ] ]

Forget about zillions of HDD-IO blocking operations & go into a pipelined distributed process-to-process solution. 忘掉数以万计的HDD-IO阻塞操作,并进入流水线分布式流程到流程解决方案。


There is nothing faster than this 没有比这更快的了


If it were , all FX business and HFT Hedge Fund Monsters would already be there... 如果是的话 ,所有外汇业务和HFT对冲基金怪兽都已经存在......

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM