[英]Create pandas data frame column based on strings from two other columns
我有一個看起來像這樣的數據框:
boat_type boat_type_2
Not Known Not Known
Not Known kayak
ship Not Known
Not Known Not Known
ship Not Known
我想創建第三列boat_type_final
,其外觀應如下所示:
boat_type boat_type_2 boat_type_final
Not Known Not Known cruise
Not Known kayak kayak
ship Not Known ship
Not Known Not Known cruise
ship Not Known ship
因此,基本上,如果boat_type
和boat_type_2
中都boat_type
boat_type_2
,則該值應為“巡航”。 但是,如果在前兩列中有除“ boat_type_final
”以外的字符串,那么boat_type_final
應該用該字符串填充,即“ kayak”或“ ship”。
最優雅的方法是什么? 我已經看過幾個選項,例如where
,創建函數和/或邏輯,而且我想知道真正的pythonista會做什么。
到目前為止,這是我的代碼:
import pandas as pd
import numpy as np
data = [{'boat_type': 'Not Known', 'boat_type_2': 'Not Known'},
{'boat_type': 'Not Known', 'boat_type_2': 'kayak'},
{'boat_type': 'ship', 'boat_type_2': 'Not Known'},
{'boat_type': 'Not Known', 'boat_type_2': 'Not Known'},
{'boat_type': 'ship', 'boat_type_2': 'Not Known'}]
df = pd.DataFrame(data
df['phone_type_final'] = np.where(df.phone_type.str.contains('Not'))...
采用:
df['boat_type_final'] = (df.replace('Not Known',np.nan)
.ffill(axis=1)
.iloc[:, -1]
.fillna('cruise'))
print (df)
boat_type boat_type_2 boat_type_final
0 Not Known Not Known cruise
1 Not Known kayak kayak
2 ship Not Known ship
3 Not Known Not Known cruise
4 ship Not Known ship
說明 :
首先replace
Not Known
replace
為缺失值:
print (df.replace('Not Known',np.nan))
boat_type boat_type_2
0 NaN NaN
1 NaN kayak
2 ship NaN
3 NaN NaN
4 ship NaN
然后通過向前填充每行來替換NaN
:
print (df.replace('Not Known',np.nan).ffill(axis=1))
boat_type boat_type_2
0 NaN NaN
1 NaN kayak
2 ship ship
3 NaN NaN
4 ship ship
通過iloc
按位置選擇最后一列:
print (df.replace('Not Known',np.nan).ffill(axis=1).iloc[:, -1])
0 NaN
1 kayak
2 ship
3 NaN
4 ship
Name: boat_type_2, dtype: object
如果可能的話, NaN
添加fillna
:
print (df.replace('Not Known',np.nan).ffill(axis=1).iloc[:, -1].fillna('cruise'))
0 cruise
1 kayak
2 ship
3 cruise
4 ship
Name: boat_type_2, dtype: object
如果只有幾列的另一種解決方案是使用numpy.select
:
m1 = df['boat_type'] == 'ship'
m2 = df['boat_type_2'] == 'kayak'
df['boat_type_final'] = np.select([m1, m2], ['ship','kayak'], default='cruise')
print (df)
boat_type boat_type_2 boat_type_final
0 Not Known Not Known cruise
1 Not Known kayak kayak
2 ship Not Known ship
3 Not Known Not Known cruise
4 ship Not Known ship
另一種解決方案是在具有映射的位置定義函數:
def my_func(row):
if row['boat_type']!='Not Known':
return row['boat_type']
elif row['boat_type_2']!='Not Known':
return row['boat_type_2']
else:
return 'cruise'
[注意:您沒有提到當兩列都不為'Unknown'時會發生什么。]
然后只需應用函數:
df.loc[:,'boat_type_final'] = df.apply(my_func, axis=1)
print(df)
輸出:
boat_type boat_type_2 boat_type_final
0 Not Known Not Known cruise
1 Not Known kayak kayak
2 ship Not Known ship
3 Not Known Not Known cruise
4 ship Not Known ship
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.