简体   繁体   English

diff文件只比较每行的前n个字符

[英]diff files comparing only first n characters of each line

I have got 2 files. 我有2个文件。 Let us call them md5s1.txt and md5s2.txt. 我们称它们为md5s1.txt和md5s2.txt。 Both contain the output of a 两者都包含a的输出

find -type f -print0 | xargs -0 md5sum | sort > md5s.txt

command in different directories. 命令在不同的目录中。 Many files were renamed, but the content stayed the same. 许多文件已重命名,但内容保持不变。 Hence, they should have the same md5sum. 因此,他们应该有相同的md5sum。 I want to generate a diff like 我想生成一个差异

diff md5s1.txt md5s2.txt

but it should compare only the first 32 characters of each line, ie only the md5sum, not the filename. 但它应该只比较每一行的前32个字符,即只比较md5sum,而不是文件名。 Lines with equal md5sum should be considered equal. 具有相同md5sum的线应该被认为是相等的。 The output should be in normal diff format. 输出应采用普通的diff格式。

Easy starter: 简易启动:

diff <(cut -d' ' -f1 md5s1.txt)  <(cut -d' ' -f1 md5s2.txt)

Also, consider just 另外,考虑一下

diff -EwburqN folder1/ folder2/

Compare only the md5 column using diff on <(cut -c -32 md5sums.sort.XXX) , and tell diff to print just the line numbers of added or removed lines, using --old/new-line-format='%dn'$'\\n' . 使用diff<(cut -c -32 md5sums.sort.XXX)上比较md5列,并使用--old/new-line-format='%dn'$'\\n'告诉diff只打印添加或删除的行的行号--old/new-line-format='%dn'$'\\n' Pipe this into ed md5sums.sort.XXX so it will print only those lines from the md5sums.sort.XXX file. 将其ed md5sums.sort.XXXed md5sums.sort.XXX这样它只会打印md5sums.sort.XXX文件中的那些行。

diff \
    --new-line-format='%dn'$'\n' \
    --old-line-format='' \
    --unchanged-line-format='' \
    <(cut -c -32 md5sums.sort.old) \
    <(cut -c -32 md5sums.sort.new) \
    | ed md5sums.sort.new \
    > files-added
diff \
    --new-line-format='' \
    --old-line-format='%dn'$'\n' \
    --unchanged-line-format='' \
    <(cut -c -32 md5sums.sort.old) \
    <(cut -c -32 md5sums.sort.new) \
    | ed md5sums.sort.old \
    > files-removed

The problem with ed is that it will load the entire file into memory, which can be a problem if you have a lot of checksums. ed的问题是它会将整个文件加载到内存中,如果你有很多校验和,这可能是个问题。 Instead of piping the output of diff into ed , pipe it into the following command, which will use much less memory. 不是将diff的输出传递给ed ,而是将它传递给下面的命令,这将使用更少的内存。

diff … | (
    lnum=0;
    while read lprint; do
        while [ $lnum -lt $lprint ]; do read line <&3; ((lnum++)); done;
        echo $line;
    done
) 3<md5sums.sort.XXX

If you are looking for duplicate files fdupes can do this for you: 如果您正在寻找重复文件,fdupes可以为您执行此操作:

$ fdupes --recurse

On ubuntu you can install it by doing 在ubuntu上你可以通过这样做来安装它

$ apt-get install fdupes

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM