I am a chemist and an average python programmer. I am trying to compare different molecules which are saved as .xyz files in a folder. There is a script available on our computing cluster ( comparestructures
) which can compare any two molecules and tell if they are similar/identical. I need to compare all the molecules among each other to identify the duplicates/similar ones so I can remove them from the study.
I have tried the following bash script to run on all the molecules: (comp1 is short for compound1)
#!/bin/sh
for comp1; do
shift
for comp2; do
echo "Comparing '$comp1' with '$comp2'"
comparestructures "$comp1" "$comp2" && echo "${comp1%.*}" "is-identical-to" "${comp2%.*}" >> identical.txt || echo "$comp1" "is-different-than" "$comp2" >> different.txt
done
done
The problem is that I am getting a list where I cannot easily identify which ones to delete as some molecules occur on both left and right sides of the output print. Is there any way I can get a list of the molecules (one from each similar pair) only so I can delete them still keep the unique ones. I need this for my research work and help would be much appreciated in this regard.
If A
is identical to B
and B
is identical to C
, I think you want B
and C
to to be removed and A
to be kept. Now, what you can do is the following:
for A in `ls`; do
[[ -e $A ]] || continue;
mkdir identical
for B in `ls`; do
[ "$A" != "$B" ] && comparestructures "$A" "$B" && mv "$B" identical
done
rm -r identical
done
This is clearly not the best and fastest solution but I am too tired to imagine a better one. If you want to test this script, I'd suggest that you put the mkdir identical
expression outside the loop and remove the rm -r identical
line, then just see if there are no problems with it.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.