[英]Python Dataframe: To get a column value from 2nd dataframe based on a column in the 1st dataframe is in between two columns in the 2nd dataframe
[英]How do I check for an integer value in a column in 1 dataframe to exist in a range split between 2 columns in 2nd dataframe?
為了更好地解釋這個問題:
我有2個數據框:
DF1(主):
CodeRange Sector Start End
0 0100-0999 Agriculture, Forestry and Fishing 0100 0999
1 1000-1499 Mining 1000 1499
2 1500-1799 Construction 1500 1799
3 1800-1999 not used 1800 1999
4 2000-3999 Manufacturing 2000 3999
5 4000-4999 Transportation, Communications, Electric, Gas ... 4000 4999
6 5000-5199 Wholesale Trade 5000 5199
7 5200-5999 Retail Trade 5200 5999
8 6000-6799 Finance, Insurance and Real Estate 6000 6799
9 7000-8999 Services 7000 8999
10 9100-9729 Public Administration 9100 9729
11 9900-9999 Nonclassifiable 9900 9999
和 DF2:
SICCode Sector
0 1230 Agro
1 4974 Utils
2 5120 shops
3 9997 Utils
在 DF1 中,我能夠將“CodeRange”列值拆分為 2 列(“Start”和“End”)並將它們轉換為 int。
我基本上想檢查 DF2 中的每個 SICCode 是否存在於哪個范圍之間,並將 DF2 中的“Sector”值更新為 DF1 中“Division”列下的相應值。
最終的 DF2 應如下所示:
DF2:
SICCode Sector
0 1230 Agriculture, Forestry and Fishing
1 4974 Transportation, Communication...
2 5120 Wholesale Trade
3 9997 Non-classifiable
更緊湊的解決方案,沒有循環
關鍵是通過將數字除以 1000 創建索引“start_idx”,以幫助我們合並,隨后,我們檢查 SICCode 是否在范圍內,當它不在時,我們將除法設為空白
df3= df.assign(start_idx=(df['Start']//1000).astype(int)).merge(
df2.assign(start_idx=(df2['SICCode']//1000).astype(int)), on='start_idx', how='left')
df3['Divison']=np.where( (df3['SICCode']> df3['Start']) &
( df3['SICCode']<=df3['End'] ), df3['Sector_y'], "")
df3.drop(columns=['start_idx','x_y','SICCode','Sector_y'])
x_x CodeRange Sector_x Start End Divison
0 0 0100-0999 Agriculture, Forestry and Fishing 100 999
1 1 1000-1499 Mining 1000 1499 Agro
2 2 1500-1799 Construction 1500 1799
3 3 1800-1999 not used 1800 1999
4 4 2000-3999 Manufacturing 2000 3999
5 5 4000-4999 Transportation, Communications, Electric, Gas ... 4000 4999 Utils
6 6 5000-5199 Wholesale Trade 5000 5199 Shops
7 7 5200-5999 Retail Trade 5200 5999
8 8 6000-6799 Finance, Insurance and Real Estate 6000 6799
9 9 7000-8999 Services 7000 8999
10 10 9100-9729 Public Administration 9100 9729
11 11 9900-9999 Nonclassifiable 9900 9999 Utils
您絕對可以使用我認為的掩碼來優化我的解決方案,但您可以通過以下方式實現:
data = []
for i in range(len(df2)):
code = df2["SICCode"].iloc[i]
for j in range(len(df1)):
start = df1["Start"].iloc[j]
end = df1["End"].iloc[j]
if code >= start and code <= end:
data.append(df1["Sector"].iloc[j])
continue # to move to the next i
df2["Sector"] = data
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.