如何根据 pandas dataframe 中其他列中的子字符串创建新列？

Question

I have a dataframe of the following structure:我有一个 dataframe 结构如下：

df = pd.DataFrame({
    'Substance': ['(NPK) 20/10/6', '(NPK) Guayacan 10/20/30', '46%N / O%P2O5 (Urea)', '46%N / O%P2O5 (Urea)', '(NPK) DAP Diammonphosphat; 18/46/0'],
    'value': [0.2, 0.4, 0.6, 0.8, .9]
})

    substance               value
0   (NPK) 20/10/6           0.2
1   (NPK) Guayacan 10/20/30 0.4
2   46%N / O%P2O5 (Urea)    0.6
3   46%N / O%P2O5 (Urea)    0.8
4   (NPK) DAP Diammonphosphat; 18/46/0          0.9

Now I want to create a new column with the short names of substance:现在我想用物质的简称创建一个新列：

test['Short Name'] = test['Substance'].apply(lambda x: 'Urea' if 
                                         any(i in x for i in 'Urea') else '(NPK)')

There are two issues with the last line of code.最后一行代码有两个问题。 First of all, the output looks like this:首先，output 看起来是这样的：

    Substance               value   Short Name
0   (NPK) 20/10/6           0.2     (NPK)
1   (NPK) Guayacan 10/20/30 0.4     Urea
2   46%N / O%P2O5 (Urea)    0.6     Urea
3   46%N / O%P2O5 (Urea)    0.8     Urea
4   (NPK) DAP Diammonphosphat; 18/46/0          0.9     (NPK)

So the second entry was also labeled with Urea although it should be NPK.所以第二个条目也标有尿素，尽管它应该是 NPK。

Furthermore, my actual data also produces the following error, which I interestingly / annoyingly can't reproduce with the dummy data despite using the original substance names.此外，我的实际数据也会产生以下错误，尽管使用了原始物质名称，但有趣/令人讨厌的是，我无法用虚拟数据重现该错误。

/var/folders/tf/hzv31v4x42q4_mnw4n8ldhsm0000gn/T/ipykernel_10743/136042259.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Note: Since I have further substances, I will have to add more statements to the if/else loop.注意：由于我还有更多的内容，我将不得不在 if/else 循环中添加更多语句。

Edit: The substance names need to be mapped to the following list of short names:编辑：物质名称需要映射到以下短名称列表：

Urea if Substance includes Urea如果物质包括尿素，则尿素
Calcium ammonium nitrate (CAN) if Substance includes CAN如果物质包括CAN ，则为硝酸铵钙 (CAN)
Di-ammonium phosphate (DAP) if Substance includes DAP如果物质包括DAP ，则为磷酸二铵 (DAP)
Other complex NK, NPK fertilizer for all other cases适用于所有其他情况的其他复合 NK、NPK 肥料

Expected output for the sample data would be样本数据的预期 output 将是

    Substance               value   Short Name
0   (NPK) 20/10/6           0.2     (NPK)
1   (NPK) Guayacan 10/20/30 0.4     (NPK)
2   46%N / O%P2O5 (Urea)    0.6     Urea
3   46%N / O%P2O5 (Urea)    0.8     Urea
4   (NPK) DAP Diammonphosphat; 18/46/0          0.9     (NPK)

Edit2: I would then like to add a statement such that I receive the following output: Edit2：然后我想添加一个语句，以便我收到以下 output：

    Substance               value   Short Name
0   (NPK) 20/10/6           0.2     (NPK)
1   (NPK) Guayacan 10/20/30 0.4     (NPK)
2   46%N / O%P2O5 (Urea)    0.6     Urea
3   46%N / O%P2O5 (Urea)    0.8     Urea
4   (NPK) DAP Diammonphosphat; 18/46/0          0.9     DAP

Answer 1

Try this:尝试这个：

df['Short Name'] = df['Substance'].str.extract(r'\((.+?)\)')

Output: Output：

>>> df
                 Substance  value Short Name
0            (NPK) 20/10/6    0.2        NPK
1  (NPK) Guayacan 10/20/30    0.4        NPK
2     46%N / O%P2O5 (Urea)    0.6       Urea
3     46%N / O%P2O5 (Urea)    0.8       Urea
4            (NPK) 20/10/6    0.9        NPK

Answer 2

Works for me:为我工作：

df['Short Name'] = df['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else '(NPK)')

>>> df
                 Substance  value   Short Name
0            (NPK) 20/10/6    0.2        (NPK)
1  (NPK) Guayacan 10/20/30    0.4        (NPK)
2     46%N / O%P2O5 (Urea)    0.6         Urea
3     46%N / O%P2O5 (Urea)    0.8         Urea
4            (NPK) 20/10/6    0.9        (NPK)

regex:正则表达式：

import re
short = re.compile(r"\W*(urea)\W*", re.I)
df['Short Name'] = df['Substance'].apply(lambda x: 'Urea' if len(short.findall(x.lower())) else '(NPK)')

Answer 3

Not the neatest solution but at least a solution:不是最整洁的解决方案，但至少是一个解决方案：

test['Short Name'] = test['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)')

如何根据 pandas dataframe 中其他列中的子字符串创建新列？

问题描述

3 个解决方案

解决方案1
2 2021-11-23 19:19:19

解决方案2
2 2021-11-23 19:24:52

解决方案3
0 2021-11-23 20:13:51

如何根据 pandas dataframe 中其他列中的子字符串创建新列？

问题描述

3 个解决方案

解决方案1 2 2021-11-23 19:19:19

解决方案2 2 2021-11-23 19:24:52

解决方案3 0 2021-11-23 20:13:51

解决方案1
2 2021-11-23 19:19:19

解决方案2
2 2021-11-23 19:24:52

解决方案3
0 2021-11-23 20:13:51