I build the following script in order to remove files that are with the same cksum ( or content )
The problem is that the script can remove files twice as the following example ( output )
My target is to remove only the duplication file and not the source file ,
SCRIPT OUTPUT:
Starting:
Same: /tmp/File_inventury.out /tmp/File_inventury.out.1
Remove: /tmp/File_inventury.out.1
Same: /tmp/File_inventury.out.1 /tmp/File_inventury.out
Remove: /tmp/File_inventury.out
Same: /tmp/File_inventury.out.2 /tmp/File_inventury.out.3
Remove: /tmp/File_inventury.out.3
Same: /tmp/File_inventury.out.3 /tmp/File_inventury.out.2
Remove: /tmp/File_inventury.out.2
Same: /tmp/File_inventury.out.4 /tmp/File_inventury.out
Remove: /tmp/File_inventury.out
Done.
.
MY SCRIPT:
#!/bin/bash
DIR="/tmp"
echo "Starting:"
for file1 in ${DIR}/File_inventury.out*; do
for file2 in ${DIR}/File_inventury.out*; do
if [ $file1 != $file2 ]; then
diff "$file1" "$file2" 1>/dev/null
STAT=$?
if [ $STAT -eq 0 ]
then
echo "Same: $file1 $file2"
echo "Remove: $file2"
rm "$file1"
break
fi
fi
done
done
echo "Done."
.
In any case I want to ear – other options about how to remove files that are with the same content or cksum ( actually need only to remove the duplication file and not the primary file )
please advice how we can do that under solaris OS , ( options for example - find one liner , awk , sed ... etc )
This version should be more efficient. I was nervous about paste
matching the correct rows, but it looks like POSIX specifies that glob'ing is sorted by default.
for i in *; do
date -u +%Y-%m-%dT%TZ -r "$i";
done > .stat; #store the last modification time in a sortable format
cksum * > .cksum; #store the cksum, size, and filename
paste .stat .cksum | #data for each file, 1 per row
sort | #sort by mtime so original comes first
awk '{
if($2 in f)
system("rm -v " $4); #rm if we have seen an occurrence of this cksum
else
f[$2]++ #count the first occurrence
}'
This should run in O(n * log(n))
time, reading each file only once.
You can put this in a shell script as:
#!/bin/sh
for i in *; do
date -u +%Y-%m-%dT%TZ -r "$i";
done > .stat;
cksum * > .cksum;
paste .stat .cksum | sort | awk '{if($2 in f) system("rm -v " $4); else f[$2]++}';
rm .stat .cksum;
exit 0;
Or do it as a one-liner:
for i in *; do date -u +%Y-%m-%dT%TZ -r "$i"; done > .stat; cksum * > .cksum; paste .stat .cksum | sort | awk '{if($2 in f) system("rm -v " $4); else f[$2]++}'; rm .stat .cksum;
I used an array as index map. So I think it is just O(n) ?
#!/bin/bash
arr=()
dels=()
for f in $1; do
read ck x fn <<< $(cksum $f)
if [[ -z ${arr[$ck]} ]]; then
arr[$ck]=$fn
else
echo "Same: ${arr[$ck]} $fn"
echo "Remove: $fn"
rm $fn
fi
done
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.