I'm trying to create a function (in Python) that takes its input (a chemical formula) and splits in into a list. For example, if the input was "HC2H3O2", it would turn it into:
molecule_list = ['H', 1, 'C', 2, 'H', 3, 'O', 2]
This, works well so far, but if I input an element with two letters in it, for example sodium (Na), it would split it into:
['N', 'a']
I'm searching for a way to make my function look through the string for keys found in a dictionary called elements. I'm also considering using regex for this, but I'm not sure how to implement it. This is what my function is right now:
def split_molecule(inputted_molecule):
"""Take the input and split it into a list
eg: C02 => ['C', 1, 'O', 2]
"""
# step 1: convert inputted_molecule to a list
# step 2a: if there are two periodic elements next to each other, insert a '1'
# step 2b: if the last element is an element, append a '1'
# step 3: convert all numbers in list to ints
# step 1:
# problem: it splits Na into 'N', 'a'
# it needs to split by periodic elements
molecule_list = list(inputted_molecule)
# because at most, the list can double when "1" is inserted
max_length_of_molecule_list = 2*len(molecule_list)
# step 2a:
for i in range(0, max_length_of_molecule_list):
try:
if (molecule_list[i] in elements) and (molecule_list[i+1] in elements):
molecule_list.insert(i+1, "1")
except IndexError:
break
# step2b:
if (molecule_list[-1] in elements):
molecule_list.append("1")
# step 3:
for i in range(0, len(molecule_list)):
if molecule_list[i].isdigit():
molecule_list[i] = int(molecule_list[i])
return molecule_list
How about
import re
print re.findall('[A-Z][a-z]?|[0-9]+', 'Na2SO4MnO4')
result
['Na', '2', 'S', 'O', '4', 'Mn', 'O', '4']
Regex explained:
Find everything that is either
[A-Z] # A,B,...Z, ie. an uppercase letter
[a-z] # followed by a,b,...z, ie. a lowercase latter
? # which is optional
| # or
[0-9] # 0,1,2...9, ie a digit
+ # and perhaps some more of them
This expression is pretty dumb since it accepts arbitrary "elements", like "Xy". You can improve it by replacing the [AZ][az]?
part with the actual list of elements' names, separated by |
, like Ba|Na|Mn...|C|O
Of course, regular expressions can only handle very simple formulas, to parse something like
8(NH4)3P4Mo12O40 + 64NaNO3 + 149NH4NO3 + 135H2O
you're going to need a real parser, eg pyparsing (be sure to check "chemical formulas" under "Examples"). Good luck!
An expression like this will match all parts of interest:
[A-Z][a-z]*|\d+
You can use it with re.findall
and then add the quantifier for atoms that have none.
Or you could use a regex for that as well:
molecule = 'NaHC2H3O2'
print re.findall(r'[A-Z][a-z]*|\d+', re.sub('[A-Z][a-z]*(?![\da-z])', r'\g<0>1', molecule))
Output:
['Na', '1', 'H', '1', 'C', '2', 'H', '3', 'O', '2']
The sub
adds a 1
after all atoms not followed by a number.
The non-regex approach, which is a bit hackish and probably not the best, but it works:
import string
formula = 'HC2H3O2Na'
m_list = list()
for x in formula:
if x in string.lowercase:
m_list.append(formula[formula.index(x)-1]+x)
_ = m_list.pop(len(m_list)-2)
else:
m_list.append(x)
print m_list
['H', 'C', '2', 'H', '3', 'O', '2', 'Na']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.