简体   繁体   English

如何根据 pandas dataframe 中其他列中的子字符串创建新列?

[英]How to create new column based on substrings in other column in a pandas dataframe?

I have a dataframe of the following structure:我有一个 dataframe 结构如下:

df = pd.DataFrame({
    'Substance': ['(NPK) 20/10/6', '(NPK) Guayacan 10/20/30', '46%N / O%P2O5 (Urea)', '46%N / O%P2O5 (Urea)', '(NPK) DAP Diammonphosphat; 18/46/0'],
    'value': [0.2, 0.4, 0.6, 0.8, .9]
})

    substance               value
0   (NPK) 20/10/6           0.2
1   (NPK) Guayacan 10/20/30 0.4
2   46%N / O%P2O5 (Urea)    0.6
3   46%N / O%P2O5 (Urea)    0.8
4   (NPK) DAP Diammonphosphat; 18/46/0          0.9

Now I want to create a new column with the short names of substance:现在我想用物质的简称创建一个新列:

test['Short Name'] = test['Substance'].apply(lambda x: 'Urea' if 
                                         any(i in x for i in 'Urea') else '(NPK)')

There are two issues with the last line of code.最后一行代码有两个问题。 First of all, the output looks like this:首先,output 看起来是这样的:

    Substance               value   Short Name
0   (NPK) 20/10/6           0.2     (NPK)
1   (NPK) Guayacan 10/20/30 0.4     Urea
2   46%N / O%P2O5 (Urea)    0.6     Urea
3   46%N / O%P2O5 (Urea)    0.8     Urea
4   (NPK) DAP Diammonphosphat; 18/46/0          0.9     (NPK)

So the second entry was also labeled with Urea although it should be NPK.所以第二个条目也标有尿素,尽管它应该是 NPK。

Furthermore, my actual data also produces the following error, which I interestingly / annoyingly can't reproduce with the dummy data despite using the original substance names.此外,我的实际数据也会产生以下错误,尽管使用了原始物质名称,但有趣/令人讨厌的是,我无法用虚拟数据重现该错误。

/var/folders/tf/hzv31v4x42q4_mnw4n8ldhsm0000gn/T/ipykernel_10743/136042259.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

Note: Since I have further substances, I will have to add more statements to the if/else loop.注意:由于我还有更多的内容,我将不得不在 if/else 循环中添加更多语句。

Edit: The substance names need to be mapped to the following list of short names:编辑:物质名称需要映射到以下短名称列表:

  • Urea if Substance includes Urea如果物质包括尿素,则尿素
  • Calcium ammonium nitrate (CAN) if Substance includes CAN如果物质包括CAN ,则为硝酸铵钙 (CAN)
  • Di-ammonium phosphate (DAP) if Substance includes DAP如果物质包括DAP ,则为磷酸二铵 (DAP)
  • Other complex NK, NPK fertilizer for all other cases适用于所有其他情况的其他复合 NK、NPK 肥料

Expected output for the sample data would be样本数据的预期 output 将是

    Substance               value   Short Name
0   (NPK) 20/10/6           0.2     (NPK)
1   (NPK) Guayacan 10/20/30 0.4     (NPK)
2   46%N / O%P2O5 (Urea)    0.6     Urea
3   46%N / O%P2O5 (Urea)    0.8     Urea
4   (NPK) DAP Diammonphosphat; 18/46/0          0.9     (NPK)

Edit2: I would then like to add a statement such that I receive the following output: Edit2:然后我想添加一个语句,以便我收到以下 output:

    Substance               value   Short Name
0   (NPK) 20/10/6           0.2     (NPK)
1   (NPK) Guayacan 10/20/30 0.4     (NPK)
2   46%N / O%P2O5 (Urea)    0.6     Urea
3   46%N / O%P2O5 (Urea)    0.8     Urea
4   (NPK) DAP Diammonphosphat; 18/46/0          0.9     DAP

Try this:尝试这个:

df['Short Name'] = df['Substance'].str.extract(r'\((.+?)\)')

Output: Output:

>>> df
                 Substance  value Short Name
0            (NPK) 20/10/6    0.2        NPK
1  (NPK) Guayacan 10/20/30    0.4        NPK
2     46%N / O%P2O5 (Urea)    0.6       Urea
3     46%N / O%P2O5 (Urea)    0.8       Urea
4            (NPK) 20/10/6    0.9        NPK

Works for me:为我工作:

df['Short Name'] = df['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else '(NPK)')
>>> df
                 Substance  value   Short Name
0            (NPK) 20/10/6    0.2        (NPK)
1  (NPK) Guayacan 10/20/30    0.4        (NPK)
2     46%N / O%P2O5 (Urea)    0.6         Urea
3     46%N / O%P2O5 (Urea)    0.8         Urea
4            (NPK) 20/10/6    0.9        (NPK)

regex:正则表达式:

import re
short = re.compile(r"\W*(urea)\W*", re.I)
df['Short Name'] = df['Substance'].apply(lambda x: 'Urea' if len(short.findall(x.lower())) else '(NPK)')

Not the neatest solution but at least a solution:不是最整洁的解决方案,但至少是一个解决方案:

test['Short Name'] = test['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 基于数据框的其他列创建一个新的熊猫数据框列 - Create a new pandas dataframe column based on other column of the dataframe 如何根据 Pandas 数据框中的其他行创建新列? - How create a new column based on other rows in pandas dataframe? Pandas:根据 DataFrame 中的其他列在 DataFrame 中创建新列 - Pandas: Create new column in DataFrame based on other column in DataFrame 如何基于另一个DataFrame中的列在Pandas DataFrame中创建新列? - How to create a new column in a Pandas DataFrame based on a column in another DataFrame? Pandas dataframe,如何创建一个新的总计列,其中包含基于其他列的值 - Pandas dataframe, how to create a new totals column containing values based on other column Pandas数据框基于其他数据框的列创建新列 - Pandas dataframe create a new column based on columns of other dataframes Pandas DataFrame 基于其他两列创建新的 csv 列 - Pandas DataFrame create new csv column based on two other columns Pandas:根据我的 dataframe 中的其他值列表创建一个新列 - Pandas: Create a new column based on a list of other values in my dataframe 根据其他列中的“NaN”值在 Pandas Dataframe 中创建一个新列 - Create a new column in Pandas Dataframe based on the 'NaN' values in other columns 基于其他列在 Pandas DataFrame 中创建新列 - Create new column in Pandas DataFrame based on other columns
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM