![](/img/trans.png)
[英]Python Dataframe: To get a column value from 2nd dataframe based on a column in the 1st dataframe is in between two columns in the 2nd dataframe
[英]How do I check for an integer value in a column in 1 dataframe to exist in a range split between 2 columns in 2nd dataframe?
为了更好地解释这个问题:
我有2个数据框:
DF1(主):
CodeRange Sector Start End
0 0100-0999 Agriculture, Forestry and Fishing 0100 0999
1 1000-1499 Mining 1000 1499
2 1500-1799 Construction 1500 1799
3 1800-1999 not used 1800 1999
4 2000-3999 Manufacturing 2000 3999
5 4000-4999 Transportation, Communications, Electric, Gas ... 4000 4999
6 5000-5199 Wholesale Trade 5000 5199
7 5200-5999 Retail Trade 5200 5999
8 6000-6799 Finance, Insurance and Real Estate 6000 6799
9 7000-8999 Services 7000 8999
10 9100-9729 Public Administration 9100 9729
11 9900-9999 Nonclassifiable 9900 9999
和 DF2:
SICCode Sector
0 1230 Agro
1 4974 Utils
2 5120 shops
3 9997 Utils
在 DF1 中,我能够将“CodeRange”列值拆分为 2 列(“Start”和“End”)并将它们转换为 int。
我基本上想检查 DF2 中的每个 SICCode 是否存在于哪个范围之间,并将 DF2 中的“Sector”值更新为 DF1 中“Division”列下的相应值。
最终的 DF2 应如下所示:
DF2:
SICCode Sector
0 1230 Agriculture, Forestry and Fishing
1 4974 Transportation, Communication...
2 5120 Wholesale Trade
3 9997 Non-classifiable
更紧凑的解决方案,没有循环
关键是通过将数字除以 1000 创建索引“start_idx”,以帮助我们合并,随后,我们检查 SICCode 是否在范围内,当它不在时,我们将除法设为空白
df3= df.assign(start_idx=(df['Start']//1000).astype(int)).merge(
df2.assign(start_idx=(df2['SICCode']//1000).astype(int)), on='start_idx', how='left')
df3['Divison']=np.where( (df3['SICCode']> df3['Start']) &
( df3['SICCode']<=df3['End'] ), df3['Sector_y'], "")
df3.drop(columns=['start_idx','x_y','SICCode','Sector_y'])
x_x CodeRange Sector_x Start End Divison
0 0 0100-0999 Agriculture, Forestry and Fishing 100 999
1 1 1000-1499 Mining 1000 1499 Agro
2 2 1500-1799 Construction 1500 1799
3 3 1800-1999 not used 1800 1999
4 4 2000-3999 Manufacturing 2000 3999
5 5 4000-4999 Transportation, Communications, Electric, Gas ... 4000 4999 Utils
6 6 5000-5199 Wholesale Trade 5000 5199 Shops
7 7 5200-5999 Retail Trade 5200 5999
8 8 6000-6799 Finance, Insurance and Real Estate 6000 6799
9 9 7000-8999 Services 7000 8999
10 10 9100-9729 Public Administration 9100 9729
11 11 9900-9999 Nonclassifiable 9900 9999 Utils
您绝对可以使用我认为的掩码来优化我的解决方案,但您可以通过以下方式实现:
data = []
for i in range(len(df2)):
code = df2["SICCode"].iloc[i]
for j in range(len(df1)):
start = df1["Start"].iloc[j]
end = df1["End"].iloc[j]
if code >= start and code <= end:
data.append(df1["Sector"].iloc[j])
continue # to move to the next i
df2["Sector"] = data
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.