简体   繁体   中英

Comparing a list of Molecules for Similarity

I am a chemist and an average python programmer. I am trying to compare different molecules which are saved as .xyz files in a folder. There is a script available on our computing cluster ( comparestructures ) which can compare any two molecules and tell if they are similar/identical. I need to compare all the molecules among each other to identify the duplicates/similar ones so I can remove them from the study.

I have tried the following bash script to run on all the molecules: (comp1 is short for compound1)

#!/bin/sh
for comp1; do
  shift
  for comp2; do
    echo "Comparing '$comp1' with '$comp2'"
    comparestructures "$comp1" "$comp2" && echo "${comp1%.*}" "is-identical-to" "${comp2%.*}" >> identical.txt || echo "$comp1" "is-different-than" "$comp2" >> different.txt
  done
done

The problem is that I am getting a list where I cannot easily identify which ones to delete as some molecules occur on both left and right sides of the output print. Is there any way I can get a list of the molecules (one from each similar pair) only so I can delete them still keep the unique ones. I need this for my research work and help would be much appreciated in this regard.

If A is identical to B and B is identical to C , I think you want B and C to to be removed and A to be kept. Now, what you can do is the following:

for A in `ls`; do
  [[ -e $A ]] || continue;
  mkdir identical
  for B in `ls`; do
    [ "$A" != "$B" ] && comparestructures "$A" "$B" && mv "$B" identical
  done
  rm -r identical
done

This is clearly not the best and fastest solution but I am too tired to imagine a better one. If you want to test this script, I'd suggest that you put the mkdir identical expression outside the loop and remove the rm -r identical line, then just see if there are no problems with it.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM