繁体   English   中英

如何检查 1 个数据帧中的列中的整数值是否存在于第 2 个数据帧中 2 列之间的范围拆分中?

[英]How do I check for an integer value in a column in 1 dataframe to exist in a range split between 2 columns in 2nd dataframe?

为了更好地解释这个问题:

我有2个数据框:

DF1(主):

    CodeRange                                             Sector Start   End
0   0100-0999                  Agriculture, Forestry and Fishing  0100  0999
1   1000-1499                                             Mining  1000  1499
2   1500-1799                                       Construction  1500  1799
3   1800-1999                                           not used  1800  1999
4   2000-3999                                      Manufacturing  2000  3999
5   4000-4999  Transportation, Communications, Electric, Gas ...  4000  4999
6   5000-5199                                    Wholesale Trade  5000  5199
7   5200-5999                                       Retail Trade  5200  5999
8   6000-6799                 Finance, Insurance and Real Estate  6000  6799
9   7000-8999                                           Services  7000  8999
10  9100-9729                              Public Administration  9100  9729
11  9900-9999                                    Nonclassifiable  9900  9999

和 DF2:

    SICCode Sector
0   1230    Agro
1   4974    Utils
2   5120    shops
3   9997    Utils

在 DF1 中,我能够将“CodeRange”列值拆分为 2 列(“Start”和“End”)并将它们转换为 int。

我基本上想检查 DF2 中的每个 SICCode 是否存在于哪个范围之间,并将 DF2 中的“Sector”值更新为 DF1 中“Division”列下的相应值。

最终的 DF2 应如下所示:

DF2:

    SICCode Sector
0   1230    Agriculture, Forestry and Fishing
1   4974    Transportation, Communication...
2   5120    Wholesale Trade
3   9997    Non-classifiable

更紧凑的解决方案,没有循环

关键是通过将数字除以 1000 创建索引“start_idx”,以帮助我们合并,随后,我们检查 SICCode 是否在范围内,当它不在时,我们将除法设为空白

df3= df.assign(start_idx=(df['Start']//1000).astype(int)).merge(
    df2.assign(start_idx=(df2['SICCode']//1000).astype(int)), on='start_idx', how='left')
df3['Divison']=np.where( (df3['SICCode']> df3['Start']) &
                       (  df3['SICCode']<=df3['End']  ), df3['Sector_y'], "")
df3.drop(columns=['start_idx','x_y','SICCode','Sector_y'])

    x_x     CodeRange   Sector_x                        Start   End     Divison
0   0   0100-0999   Agriculture, Forestry and Fishing   100     999     
1   1   1000-1499   Mining                              1000    1499    Agro
2   2   1500-1799   Construction                        1500    1799    
3   3   1800-1999   not used                            1800    1999    
4   4   2000-3999   Manufacturing                       2000    3999    
5   5   4000-4999   Transportation, Communications, Electric, Gas ...   4000    4999    Utils
6   6   5000-5199   Wholesale Trade                      5000   5199    Shops
7   7   5200-5999   Retail Trade                        5200    5999    
8   8   6000-6799   Finance, Insurance and Real Estate  6000    6799    
9   9   7000-8999   Services                            7000    8999    
10  10  9100-9729   Public Administration               9100    9729    
11  11  9900-9999   Nonclassifiable                     9900    9999    Utils

您绝对可以使用我认为的掩码来优化我的解决方案,但您可以通过以下方式实现:

data = []
for i in range(len(df2)):
    code = df2["SICCode"].iloc[i]
    for j in range(len(df1)):
        start = df1["Start"].iloc[j]
        end = df1["End"].iloc[j]
        if code >= start and code <= end:
            data.append(df1["Sector"].iloc[j])
            continue # to move to the next i

df2["Sector"] = data

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM