简体   繁体   English

Java中将多个文件的子部分合并为一个文件

[英]Merging sub parts of multiple files into a single file in Java

I have n files each containing m blocks of data.我有 n 个文件,每个文件包含 m 个数据块。

    File 0 Contents:
    file0.block1
    file0.block2
    file0.block3
    file0.block4
    ..
    file0.blockM
    File 1 Contents:
    file1.block1
    file1.block2
    file1.block3
    file1.block4
    ..
    file1.blockM

... ...

    File n Contents:
    fileN.block1
    fileN.block2
    fileN.block3
    fileN.block4
    ..
    fileN.blockM

The blocks are of variable size.块的大小可变。 Blocks having the same Id can have variable sizes across different files.具有相同 ID 的块可以在不同文件中具有可变大小。

The merged file should look like this.合并后的文件应如下所示。

    Merged File Contents:
    file0.block1
    file1.block1
    ...
    fileN.block1
    
    file0.block2
    file1.block2
    ...
    fileN.block2
    
    ..
    
    file0.blockM
    file1.blockM
    ...
    fileN.blockM

Is N really so large that keeping the files open is not an option? N 真的那么大,以至于无法保持文件打开吗? At least on Linux the hard limit of possible open files is quite large.至少在 Linux 上,可能打开的文件的硬限制相当大。 ulimit -Hn gives me 1048576 on Xubuntu 20.04. ulimit -Hn在 Xubuntu 20.04 上给我 1048576。 The soft limit is much smaller with 1024 by default but that can be raised using ulimit -n N .默认情况下,软限制要小得多,为 1024,但可以使用ulimit -n N来提高。 Not sure what sensible values for N are but you can try using what you think is the maximum N you will encounter in your application.不确定 N 的合理值是多少,但您可以尝试使用您认为在应用程序中会遇到的最大 N。 Note: I do not know if Java imposes limits beyond what the OS does or if keeping a million files open costs a lot of memory (I would expect the memory cost for an InputStream to be in the order of a few KBs).注意:我不知道 Java 是否施加了超出操作系统所做的限制,或者如果保持一百万个文件打开会花费很多 memory(我希望 memory 的输入流的顺序是几个 KBs 的成本)。 Also, no idea how this works on Windows.此外,不知道这在 Windows 上是如何工作的。

The only middle ground I can think of between either opening/closing files all the time or keeping all files open all the time would be to process a number of files at a time and join them into temporary files, then join the temp files to form the final result.在始终打开/关闭文件或始终打开所有文件之间,我能想到的唯一中间立场是一次处理多个文件并将它们加入临时文件,然后加入临时文件以形成最终结果。 Clearly, that avoids the opening/closing scenario but comes at the cost of re-writing the data more often, which might be slow on spinning disks and wears down SSDs if the files are of any significant size.显然,这避免了打开/关闭场景,但代价是更频繁地重写数据,如果文件大小很大,这可能会使旋转磁盘变慢并磨损 SSD。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM