简体   繁体   中英

Counting element symbols in Python

I'm a biologist and I'm quite new in programming, but nowadays i'm trying to improve; my background is not about informatics.

I`m quite stuck in a problem.

We've some information about molecules; each line that begins with ATOM represents one atom of the entire molecule. For example, the first two lines:

ATOM      1  N   ARG A   1       0.609  18.920  11.647  1.00 18.79           N

ATOM      2  CA  ARG A   1       0.149  17.722  10.984  1.00 13.68           C

We are supposed to count the number of distinct atoms; better said, the last item of every line ( C or N in the eg)

We have already the function that drives us and extract the last item, but I'm quite stuck at this point, because we should write the code as if we don't know already which atoms we will find (though we know, because we have the entire list, and we have N , C , O and S )

Code we have:

def count_atom(molecule):

    number_atoms = dict()
    lines = molecule.split(os.linesep)
    for line in lines:
        if line.startswith('ATOM'):
            atom = line[77].strip()
        print atom


    return number_atoms

results= count_atoms(molecule)

molecule represents the entire list.

Hope i understand you right, but you want to count the occurrence of the last char of the string?

molecule = '''ATOM 1 N ARG A 1 0.609 18.920 11.647 1.00 18.79 N
ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 C
ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 Se
ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 Pu
ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 Pu
ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 C'''

def count_atoms(molecule):
    number_atoms = dict()
    lines = molecule.split(os.linesep)
    for line in lines:
        if line.startswith('ATOM'):
            atom = line.split()[-1].strip()
            if number_atoms.get(atom):
                number_atoms[atom] += 1
            else:
                number_atoms.update({atom: 1})
    return number_atoms

print(count_atoms(molecule))

Output:

print(count_atoms(molecule))
{'Se': 1, 'Pu': 2, 'N': 1, 'C': 2}

Welcome to Python!

Python has lots of useful modules that take care of common problems.

To solve your problem you can import Counter from collections :

from collections import Counter

>>> molecule = '''ATOM 1 N ARG A 1 0.609 18.920 11.647 1.00 18.79 N
    ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 C
    ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 C'''
>>> Counter(line.split()[-1] for line in molecule.splitlines())
Counter({'C': 2, 'N': 1})

line.split()[-1] gets the last word of the line in case you have elements that have longer chemical symbols, splitlines() separates the lines from each other.

Counter s can be added and subtracted from each other, which might be useful for you:

>>> mycount = Counter(line.split()[-1] for line in molecule.splitlines())
>>> mycount + mycount
Counter({'C': 4, 'N': 2})

This will give you not only the number of distinct atoms, but also the number of appearances throughout the entire molecule. The number of distinct atoms can be retrieved by taking the len of the Counter ):

>>> len(Counter(line.split()[-1] for line in molecule.splitlines()))
2

More elaborate example:

molecule = '''ATOM 1 N ARG A 1 0.609 18.920 11.647 1.00 18.79 N
ATOM 2 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 C
ATOM 3 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 Se
ATOM 4 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 Pu
ATOM 5 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 Pu
ATOM 6 CA ARG A 1 0.149 17.722 10.984 1.00 13.68 C'''
>>> Counter(line.split()[-1] for line in molecule.splitlines())
Counter({'C': 2, 'N': 1, 'Pu': 2, 'Se': 1})
>>> len(Counter(line.split()[-1] for line in molecule.splitlines()))
4

Although all the answers are correct in terms of Python, we have lines from a PDB file :

Record Format

COLUMNS        DATA  TYPE    FIELD        DEFINITION
-------------------------------------------------------------------------------------
 1 -  6        Record name   "ATOM  "
[...]
77 - 78        LString(2)    element      Element symbol, right-justified.
[...]

For elements like SE lenium which exist in plenty of protein structures both characters [77-78] need to be taken in account, otherwise it will become S ulfur or E .

If you don't want to deal with the whole parsing issue yourself, you can use BioPython's PDB module in combination with any of the solutions above.

from Bio.PDB import PDBParser
from collections import Counter
parser = PDBParser()
structure = parser.get_structure('PHA-L', '1fat.pdb')

atoms = list()
for model in structure:
    for chain in model:
        for residue in chain:
            for atom in residue:
                atoms.append(atom.element)

print(Counter(atoms))

Counter({'C': 4570, 'O': 1463, 'N': 1207, 'MN': 4, 'CA': 4})

As the lines of your example doesn't have same length, so try access data by index would be a bad idea, like you do in atom = line[77].strip()

As you said, the info that distinct the atoms is the last character. So you can access just the last character using the last item index notation from lists.

>>> data = "ATOM 1 N ARG A 1 0.609 18.920 11.647 1.00 18.79 N"
>>> print(data[-1])
N
lines = ['ATOM 1 N ARG A 1 0.609 18.920 11.647 1.00 18.79 N', 'ATOM 1 N ARG A 1 0.609 18.920 11.647 1.00 18.79 C', 'ATOM 1 N ARG A 1 0.609 18.920 11.647 1.00 18.79 N']

all_elements = {l.split()[-1]  for l in lines}    
counts = {element: 0 for element in all_elements}
for line in lines: 
    counts[line.split()[-1]] += 1
counts
{'C': 1, 'N': 2}

this is how you count number of atoms of each element, if you just need number of elements, you can just use len(counts)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM