[英]Pandas: merge dataframes without creating new columns inside a for operation
我正在嘗試使用從 API 收集的數據來豐富 dataframe。 所以,我會這樣:
for i in df.index:
if pd.isnull(df.cnpj[i]) == True:
pass
else:
k=get_financials_hnwi(df.cnpj[i]) # this is my API requesting function, working fine
df=df.merge(k,on=["cnpj"],how="left") # here is my problem <-------------------------------
由於我在 for 語句中運行該合並,因此它顯示了后綴 (_x, _y)。 所以我在這里找到了這個替代方案:
for i in df.index:
if pd.isnull(df.cnpj[i]) == True:
pass
else:
k=get_financials_hnwi(df.cnpj[i]) # this is my requesting function, working fine
val = np.intersect1d(df.cnpj, k.cnpj)
df_temp = pd.concat([df,k], ignore_index=True)
df=df_temp[df_temp.cnpj.isin(val)]
但是,它會創建一個新的 df,殺死原始索引,並且if pd.isnull(df.cnpj[i]) == True:
不允許該行運行。
有沒有一種很好的方法可以在 for 操作中運行合並/加入/連接而不用 _x 和 _y 創建新列? 或者有一種方法可以混合 _x 和 _y 列之后擺脫它並將其濃縮在一個列中? 我只想要一個包含所有內容的列
示例數據和可重現的代碼
df=pd.DataFrame({'cnpj':[12,32,54,65],'co_name':['Johns Market','T Bone Gril','Superstore','XYZ Tech']})
#first API request:
k=pd.DataFrame({'cnpj':[12],'average_revenues':[687],'years':['2019,2018,2017']})
df=df.merge(k,on="cnpj", how='left')
#second API request:
k=pd.DataFrame({'cnpj':[32],'average_revenues':[456],'years':['2019,2017']})
df=df.merge(k,on="cnpj", how='left')
#third API request:
k=pd.DataFrame({'cnpj':[53],'average_revenues':[None],'years':[None]})
df=df.merge(k,on="cnpj", how='left')
#fourth API request:
k=pd.DataFrame({'cnpj':[65],'average_revenues':[4142],'years':['2019,2018,2015,2013,2012']})
df=df.merge(k,on="cnpj", how='left')
print(df)
結果:
cnpj co_name average_revenues_x years_x average_revenues_y \
0 12 Johns Market 687.0 2019,2018,2017 NaN
1 32 T Bone Gril NaN NaN 456.0
2 54 Superstore NaN NaN NaN
3 65 XYZ Tech NaN NaN NaN
years_y average_revenues_x years_x average_revenues_y \
0 NaN None None NaN
1 2019,2017 None None NaN
2 NaN None None NaN
3 NaN None None 4142.0
years_y
0 NaN
1 NaN
2 NaN
3 2019,2018,2015,2013,2012
期望的結果:
cnpj co_name average_revenues years
0 12 Johns Market 687.0 2019,2018,2017
1 32 T Bone Gril 456.0 2019,2017
2 54 Superstore None None
3 65 XYZ Tech 4142.0 2019,2018,2015,2013,2012
當您加入單個列和映射值時,我們可以利用cnpj
列並將其設置為索引,然后我們可以使用combine_first
或update
或map
將您的值添加到您的 dataframe 中。
假設k
看起來像這樣。 如果不只是更新 function 以返回您可以使用map
的字典。
cnpj average_revenues years
0 12 687 2019,2018,2017
讓我們把它放在一個整潔的 function 中。
def update_api_call(dataframe,api_call):
if dataframe.index.name == 'cnpj':
pass
else:
dataframe = dataframe.set_index('cnpj')
return dataframe.combine_first(
api_call.set_index('cnpj')
)
假設您的變量k
s 在我們的測試中編號為 1-4。
df1 = update_api_call(df,k1)
print(df1)
average_revenues co_name years
cnpj
12 687.0 Johns Market 2019,2018,2017
32 NaN T Bone Gril NaN
54 NaN Superstore NaN
65 NaN XYZ Tech NaN
df2 = update_api_call(df1,k2)
print(df2)
average_revenues co_name years
cnpj
12 687.0 Johns Market 2019,2018,2017
32 456.0 T Bone Gril 2019,2017
54 NaN Superstore NaN
65 NaN XYZ Tech NaN
print(df4)
average_revenues co_name years
cnpj
12 687.0 Johns Market 2019,2018,2017
32 456.0 T Bone Gril 2019,2017
53 NaN NaN NaN
54 NaN Superstore NaN
65 4142.0 XYZ Tech 2019,2018,2015,2013,2012
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.