使用正则表达式解析化学公式

Question

I'm trying to do the following thing: given a single-column pandas.Dataframe (of chemical formulas) like我正在尝试做以下事情：给定单列pandas.Dataframe （化学式），例如

    formula
0   Hg0.7Cd0.3Te
1   CuBr
2   Lu
...

I would like to return a pandas.Series like我想退回pandas.Series类的

0         [(Hg, 0.7), (Cd, 0.3), (Te,1)]
1                [(Cu, 1), (Br, 1)]
2                [(Lu, 1), (P, 1)]
...

So this is the desired output.所以这是所需的 output。

I've already tried something with a regex expression:我已经用正则表达式尝试了一些东西：

counts = pd.Series(formulae.values.flatten()).str.findall(r"([a-z]+)([0-9]+)", re.I)

but unfortunately my output is the following:但不幸的是我的 output 如下：

0         [(Hg, 0), (Cd, 0)]
1                         []
2                         []
3       [(Cu, 3), (SbSe, 4)]

so it's not recognizing in some cases different elements in the chemical formula.所以它在某些情况下无法识别化学式中的不同元素。

Answer 1

There are a few things to be improved:有几点需要改进：

The number pattern does not allow floating point numbers yet.数字模式还不允许浮点数。 Here, you can use ([0-9]+(?:[.][0-9]+)?) instead.在这里，您可以使用([0-9]+(?:[.][0-9]+)?)代替。
The number might not be present at all, so that needs to be indicated by a trailing ?该数字可能根本不存在，因此需要用尾随? . .
The elements all start with an uppercase letter, followed by zero or more (zero or one?) lower case letters.元素都以大写字母开头，后跟零个或多个（零还是一个？）小写字母。 So the element name pattern would be [AZ][az]* .所以元素名称模式将是[AZ][az]* 。 That's important to distinguish different elements with no number in between, eg 'CuBr' (so ignore-case wouldn't work here).区分中间没有数字的不同元素很重要，例如'CuBr' （因此忽略大小写在这里不起作用）。

Putting it all together:把它们放在一起：

from pprint import pprint
import re

formulae = ['Hg0.7Cd0.3Te', 'CuBr', 'Lu']

pattern = re.compile('([A-Z][a-z]*)([0-9]+(?:[.][0-9]+)?)?')

pprint([pattern.findall(f) for f in formulae])

The prints the following:打印以下内容：

[[('Hg', '0.7'), ('Cd', '0.3'), ('Te', '')],
 [('Cu', ''), ('Br', '')],
 [('Lu', '')]]

As you can see, missing numbers are denoted by empty strings which you need to postprocess manually.如您所见，缺少的数字由您需要手动后处理的空字符串表示。 For example:例如：

result = [pattern.findall(f) for f in formulae]
result = [[(e, float(n or 1)) for e, n in f] for f in result]

Answer 2

You can use您可以使用

import pandas as pd
df = pd.DataFrame({'formula':['Hg0.7Cd0.3Te', 'CuBr', 'Lu']})
df['counts'] = df['formula'].str.findall(r'([A-Z][a-z]*)(\d+(?:\.\d+)?)?')
df['counts'] = df['counts'].apply(lambda x: [(a,b) if b else (a,1) for a,b in x])

Output: Output：

>>> df['counts']
0    [(Hg, 0.7), (Cd, 0.3), (Te, 1)]
1                 [(Cu, 1), (Br, 1)]
2                          [(Lu, 1)]

Details :详情：

([AZ][az]*) - Group 1: an uppercase letter followed with zero or more lowercase letters ([AZ][az]*) - 第 1 组：一个大写字母后跟零个或多个小写字母
(\d+(?:\.\d+)?)? - an optional group 2: one or more diits followed with an optional occurrence of a dot and one or more digits. - 可选组 2：一个或多个 diits 后跟一个可选出现的点和一个或多个数字。

The df['counts'].apply(lambda x: [(a,b) if b else (a,1) for a,b in x]) adds 1 as each tuple second item where it is empty. df['counts'].apply(lambda x: [(a,b) if b else (a,1) for a,b in x])将1作为每个元组的第二项添加 1，如果它是空的。

Answer 3

Would use multiple replace to introduce separators, split using introduced separators, explode and then filter.将使用多个替换来引入分隔符，使用引入的分隔符进行拆分，分解然后过滤。 Code below下面的代码

repl2 =  lambda g: f'{str(g.group(1)) }<'
repl3 =  lambda g: f'{str(g.group(1)) }>'
df1 = (df1.assign(formula1=df1['formula'].str.replace('((?<=[A-Z])\w)', repl3, regex=True)#Introduce separator where alpha numeric follows a cap letter
                 .str.replace('(\d(?=[A-Z]))', repl2, regex=True))#Introduce separator where digits is followed by cap letter
.replace(regex={r'\>(?=0)': ',', '\>': ',1 '})#Replace the < and > introduced separators
      )

df1=df1.assign(formula1=df1['formula1'].str.split('\<|\s')).explode('formula1')#Explode dataframe

new=df1[df1['formula1'].str.contains('\w')]#filter those rows that have details



    formula      formula1
0  Hg0.7Cd0.3Te   Hg,0.7
0  Hg0.7Cd0.3Te   Cd,0.3
0  Hg0.7Cd0.3Te     Te,1
1          CuBr     Cu,1
1          CuBr     Br,1
2            Lu     Lu,1

使用正则表达式解析化学公式

问题描述

3 个解决方案

解决方案1
1 2022-02-03 12:19:40

解决方案2
1 2022-02-03 20:13:35

解决方案3
0 2022-02-03 14:17:31

使用正则表达式解析化学公式

问题描述

3 个解决方案

解决方案1 1 2022-02-03 12:19:40

解决方案2 1 2022-02-03 20:13:35

解决方案3 0 2022-02-03 14:17:31

解决方案1
1 2022-02-03 12:19:40

解决方案2
1 2022-02-03 20:13:35

解决方案3
0 2022-02-03 14:17:31