简体繁体 English

有效压缩具有许多相同文件的文件系统目录树

[英]Efficient compression of a file system directory tree with many identical files

原文 2014-08-31 17:44:18 6 1 windows/ zip

We have multiple .NET web applications all sharing quite a few common libraries. 我们有多个.NET Web应用程序，它们共享许多公共库。 None of them are in the GAC. 他们都不在GAC中。

The deployment constraint is that all of these web applications have dedicated directories. 部署约束是所有这些Web应用程序都具有专用目录。 Which results in large amount of duplicated dlls in the total directory structure. 这导致整个目录结构中有大量重复的dll。

This directory structure is extracted from a single zip archive. 此目录结构是从单个zip存档中提取的。

As a result the zip archive has many identical files found in different directories. 结果，zip存档在不同目录中找到了许多相同的文件。

This is huge redundancy, which I want to eliminate in the zip archive, I do not care much if redundant files are created on the disk. 这是巨大的冗余，我想在zip存档中消除它，我不太在乎是否在磁盘上创建了冗余文件。 I see two ways optimize the zip: 我看到两种方法来优化zip：

Use windows symbolic links and junctions to reduce the amount of physical identical files. 使用Windows符号链接和结点可减少物理上相同文件的数量。
Use smart compression that would not compress the same file data twice. 使用不会两次压缩相同文件数据的智能压缩。

Method 1 方法1

I used zip and 7z to test compressing directory structures. 我使用zip和7z来测试压缩目录结构。 I used junctions and file symbolic links as the means to reduce space on disk. 我使用结点和文件符号链接作为减少磁盘空间的方法。

Unfortunately, both zip and 7z compress junctions as if they were full blown directories. 不幸的是，zip和7z都压缩了连接，就好像它们是完整目录一样。 A symbolic link is compressed as a zero length file by 7z, its nature as a symbolic link is lost upon decompression. 符号链接被7z压缩为零长度文件，解压缩后失去了其作为符号链接的性质。 zip traverses the symbolic link and compresses the target data instead, which results in duplicate file content in the archive. zip会遍历符号链接并改为压缩目标数据，这会导致存档中的文件内容重复。

In short I failed to eliminate the duplicate file data using the first method. 简而言之，我无法使用第一种方法消除重复的文件数据。

Method 2 方法二

What I want is exactly described by http://sourceforge.net/p/sevenzip/feature-requests/794/ . http://sourceforge.net/p/sevenzip/feature-requests/794/准确描述了我想要的内容。 However, it is nothing more than a feature request. 但是，无非就是功能请求。

A comment to the feature request mentions lrzip as an efficient huge file compressor. 在对功能请求的评论中，提到lrzip是一种有效的大型文件压缩器。 I have to check it, but it does not seem to eliminate duplicate file data the way I would like it to be. 我必须检查它，但是它似乎并不能按照我希望的方式消除重复的文件数据。

Any help is welcome. 欢迎任何帮助。

1 个解决方案

mark, how did you try lrzip? 马克，您如何尝试lrzip？ It can't detect duplicates inside compressed archive (default zip); 它无法检测压缩存档（默认zip）中的重复项； it should be used with some non-compressing archive (in Unix world - with tar) or zipfile created without compression (you will get archive with the size almost equal to sum of input sizes). 它应该与一些非压缩的归档文件（在Unix世界中使用tar）或未经压缩的zipfile一起使用（您将得到的归档文件的大小几乎等于输入大小的总和）。

You can also try any multi-file compressor, capable of solid mode (rar, 7z), but this may not work if your archive is huge and there is big distance between duplicates. 您也可以尝试使用任何支持固态模式的多文件压缩器（rar，7z），但是如果您的存档很大并且重复项之间的距离很大，则此方法可能不起作用。 lrzip supports greater distance. lrzip支持更大的距离。

Tar (and PAX) on Unix supports hard and soft links: http://www.gnu.org/software/tar/manual/html_section/tar_71.html#SEC140 Unix上的Tar（和PAX）支持硬链接和软链接： http : //www.gnu.org/software/tar/manual/html_section/tar_71.html#SEC140