简体   繁体   English

如何只删除某个目录下的复制文件(具有相同的cksum)

[英]how to remove only the duplication file under some directory ( with the same cksum )

I build the following script in order to remove files that are with the same cksum ( or content ) 我构建以下脚本,以删除具有相同cksum(或content)的文件

The problem is that the script can remove files twice as the following example ( output ) 问题在于,该脚本可以两次删除文件,如以下示例所示(输出)

My target is to remove only the duplication file and not the source file , 我的目标是仅删除复制文件,而不是源文件,

SCRIPT OUTPUT: 脚本输出:

  Starting:
  Same: /tmp/File_inventury.out /tmp/File_inventury.out.1
  Remove: /tmp/File_inventury.out.1
  Same: /tmp/File_inventury.out.1 /tmp/File_inventury.out
  Remove: /tmp/File_inventury.out
  Same: /tmp/File_inventury.out.2 /tmp/File_inventury.out.3
  Remove: /tmp/File_inventury.out.3
  Same: /tmp/File_inventury.out.3 /tmp/File_inventury.out.2
  Remove: /tmp/File_inventury.out.2
  Same: /tmp/File_inventury.out.4 /tmp/File_inventury.out
  Remove: /tmp/File_inventury.out
  Done.

.

MY SCRIPT: 我的脚本:

 #!/bin/bash
  DIR="/tmp"
 echo "Starting:"
  for file1 in ${DIR}/File_inventury.out*; do
    for file2 in ${DIR}/File_inventury.out*; do
            if [ $file1 != $file2 ]; then
                    diff "$file1" "$file2" 1>/dev/null
                    STAT=$?
                    if [ $STAT -eq 0 ]
                     then
                            echo "Same: $file1 $file2"
                            echo "Remove: $file2"
                            rm "$file1"
                            break
                    fi
            fi
    done
 done
 echo "Done."

.

In any case I want to ear – other options about how to remove files that are with the same content or cksum ( actually need only to remove the duplication file and not the primary file ) 无论如何,我想听听–关于如何删除具有相同内容或cksum的文件的其他选择(实际上只需要删除重复文件,而不是主文件)

please advice how we can do that under solaris OS , ( options for example - find one liner , awk , sed ... etc ) 请建议我们如何在solaris操作系统下做到这一点(例如,选择-查找一个liner,awk,sed等)。

This version should be more efficient. 这个版本应该更有效。 I was nervous about paste matching the correct rows, but it looks like POSIX specifies that glob'ing is sorted by default. 我对paste匹配正确的行感到不安,但是看起来POSIX指定默认对glob'ing进行排序。

for i in *; do
    date -u +%Y-%m-%dT%TZ -r "$i";
done > .stat;         #store the last modification time in a sortable format
cksum * > .cksum;     #store the cksum, size, and filename
paste .stat .cksum |  #data for each file, 1 per row
    sort |            #sort by mtime so original comes first
    awk '{
        if($2 in f)
            system("rm -v " $4); #rm if we have seen an occurrence of this cksum
        else
            f[$2]++              #count the first occurrence
    }'

This should run in O(n * log(n)) time, reading each file only once. 这应该以O(n * log(n))时间运行,每个文件仅读取一次。

You can put this in a shell script as: 您可以将其放在shell脚本中,如下所示:

#!/bin/sh

for i in *; do
    date -u +%Y-%m-%dT%TZ -r "$i";
done > .stat;
cksum * > .cksum;
paste .stat .cksum | sort | awk '{if($2 in f) system("rm -v " $4); else f[$2]++}';
rm .stat .cksum;
exit 0;

Or do it as a one-liner: 或单线执行:

for i in *; do date -u +%Y-%m-%dT%TZ -r "$i"; done > .stat; cksum * > .cksum; paste .stat .cksum | sort | awk '{if($2 in f) system("rm -v " $4); else f[$2]++}'; rm .stat .cksum;

I used an array as index map. 我使用数组作为索引映射。 So I think it is just O(n) ? 所以我认为这只是O(n)吗?

#!/bin/bash

arr=()
dels=()
for f in $1; do
  read ck x fn <<< $(cksum $f)
  if [[ -z ${arr[$ck]} ]]; then 
    arr[$ck]=$fn
  else
    echo "Same: ${arr[$ck]} $fn"
    echo "Remove: $fn"
    rm $fn
  fi
done

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM