简体   繁体   English

如何为搜索和重复数据删除在Linux文件系统上的文件元数据中添加md5 sum?

[英]How to add md5 sum to file metadata on a Linux file-system for the purpose of search and de-duplication?

I have a large number of files that occasionally have duplicates that have different names and would like to add a fslint like capability to the file system so that it can be de-duplicated and then any new files created in specified locations be checked against known md5 values. 我有大量文件,有时会重复使用不同名称的重复文件,并希望向文件系统添加类似fslint的功能,以便可以将其删除重复数据,然后根据已知的md5检查在指定位置创建的任何新文件价值观。 The intent being that after the initial summing of the entire file collection the overhead is smaller as it then only requires the new file's md5 sum to be compared against the store of existing sums. 目的是在对整个文件集合进行初始求和之后,开销较小,因为它只需要将新文件的md5和与现有和的存储进行比较即可。 This check could be a daily job or as part of a file submission process. 该检查可以是日常工作,也可以作为文件提交过程的一部分。

checkandsave -f newfile -d destination

Does this utility already exist? 这个工具已经存在了吗? What is the best way of storing fileid-md4sum pairs to that the search on a new file's sum is as fast as possible? 存储fileid-md4sum对的最佳方法是什么,以使对新文件的总和的搜索尽可能快?

re Using rmlink: 重新使用rmlink:

Where does rmlink store the checksums, or is that work repeated every run? rmlink将校验和存储在哪里,还是每次运行都重复进行该工作? I want to add the checksum to the file metadata (or some form of store that optimises search speed) so that when I have a new file I generate it's sum and check it against the existing pre-calculated sums, for all files of the same size. 我想将校验和添加到文件元数据(或某种优化搜索速度的存储形式)中,以便当我有一个新文件时,生成它的总和,并针对所有相同文件的现有总和进行核对尺寸。

Yes rmlint can do this via the --xattr-read --xattr-write options. 是的, rmlint可以通过--xattr-read --xattr-write选项执行此操作。

The cron job would be something like: Cron工作将类似于:

/usr/bin/rmlint -T df -o sh:/home/foo/dupes.sh -c sh:link --xattr-read --xattr-write /path/to/files / usr / bin / rmlint -T df -o sh:/home/foo/dupes.sh -c sh:link --xattr-read --xattr-write / path / to / files

-T df means just look for duplicate files -T df表示仅查找重复文件

-o sh:/home/foo/newdupes.sh specifies where to put the output report / shell script (if you want one) -o sh:/home/foo/newdupes.sh指定将输出报告/ shell脚本放在哪里(如果需要)

-c sh:link specifies that the shell script should replace duplicates with hardlinks or symlinks (or reflinks on btrfs) -c sh:link指定shell脚本应使用硬链接或符号链接(或btrfs上的reflinks)替换重复项

Note that rmlint only calculates file checksums when necessary, for example if there is only one file with a given size then there is no chance of a duplicate so no checksum is calculated. 请注意, rmlint仅在必要时计算文件校验和,例如,如果只有一个给定大小的文件,则没有重复的机会,因此不会计算校验和。

Edit: the checksums are stored in the file extended attributes metadata. 编辑:校验和存储在文件扩展属性元数据中。 The default uses SHA1 but you can switch this to md5 via -a md5 缺省使用SHA1,但是您可以通过-a md5将其切换为md5。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM