[英]Create a new pandas dataframe column based on other column of the dataframe
[英]How to create new column based on substrings in other column in a pandas dataframe?
我有一個 dataframe 結構如下:
df = pd.DataFrame({
'Substance': ['(NPK) 20/10/6', '(NPK) Guayacan 10/20/30', '46%N / O%P2O5 (Urea)', '46%N / O%P2O5 (Urea)', '(NPK) DAP Diammonphosphat; 18/46/0'],
'value': [0.2, 0.4, 0.6, 0.8, .9]
})
substance value
0 (NPK) 20/10/6 0.2
1 (NPK) Guayacan 10/20/30 0.4
2 46%N / O%P2O5 (Urea) 0.6
3 46%N / O%P2O5 (Urea) 0.8
4 (NPK) DAP Diammonphosphat; 18/46/0 0.9
現在我想用物質的簡稱創建一個新列:
test['Short Name'] = test['Substance'].apply(lambda x: 'Urea' if
any(i in x for i in 'Urea') else '(NPK)')
最后一行代碼有兩個問題。 首先,output 看起來是這樣的:
Substance value Short Name
0 (NPK) 20/10/6 0.2 (NPK)
1 (NPK) Guayacan 10/20/30 0.4 Urea
2 46%N / O%P2O5 (Urea) 0.6 Urea
3 46%N / O%P2O5 (Urea) 0.8 Urea
4 (NPK) DAP Diammonphosphat; 18/46/0 0.9 (NPK)
所以第二個條目也標有尿素,盡管它應該是 NPK。
此外,我的實際數據也會產生以下錯誤,盡管使用了原始物質名稱,但有趣/令人討厭的是,我無法用虛擬數據重現該錯誤。
/var/folders/tf/hzv31v4x42q4_mnw4n8ldhsm0000gn/T/ipykernel_10743/136042259.py:5: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
注意:由於我還有更多的內容,我將不得不在 if/else 循環中添加更多語句。
編輯:物質名稱需要映射到以下短名稱列表:
樣本數據的預期 output 將是
Substance value Short Name
0 (NPK) 20/10/6 0.2 (NPK)
1 (NPK) Guayacan 10/20/30 0.4 (NPK)
2 46%N / O%P2O5 (Urea) 0.6 Urea
3 46%N / O%P2O5 (Urea) 0.8 Urea
4 (NPK) DAP Diammonphosphat; 18/46/0 0.9 (NPK)
Edit2:然后我想添加一個語句,以便我收到以下 output:
Substance value Short Name
0 (NPK) 20/10/6 0.2 (NPK)
1 (NPK) Guayacan 10/20/30 0.4 (NPK)
2 46%N / O%P2O5 (Urea) 0.6 Urea
3 46%N / O%P2O5 (Urea) 0.8 Urea
4 (NPK) DAP Diammonphosphat; 18/46/0 0.9 DAP
嘗試這個:
df['Short Name'] = df['Substance'].str.extract(r'\((.+?)\)')
Output:
>>> df
Substance value Short Name
0 (NPK) 20/10/6 0.2 NPK
1 (NPK) Guayacan 10/20/30 0.4 NPK
2 46%N / O%P2O5 (Urea) 0.6 Urea
3 46%N / O%P2O5 (Urea) 0.8 Urea
4 (NPK) 20/10/6 0.9 NPK
為我工作:
df['Short Name'] = df['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else '(NPK)')
>>> df
Substance value Short Name
0 (NPK) 20/10/6 0.2 (NPK)
1 (NPK) Guayacan 10/20/30 0.4 (NPK)
2 46%N / O%P2O5 (Urea) 0.6 Urea
3 46%N / O%P2O5 (Urea) 0.8 Urea
4 (NPK) 20/10/6 0.9 (NPK)
正則表達式:
import re
short = re.compile(r"\W*(urea)\W*", re.I)
df['Short Name'] = df['Substance'].apply(lambda x: 'Urea' if len(short.findall(x.lower())) else '(NPK)')
不是最整潔的解決方案,但至少是一個解決方案:
test['Short Name'] = test['Substance'].apply(lambda x: 'Urea' if 'Urea' in x else 'DAP' if 'DAP' in x else '(NPK)')
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.