如何检查 1 个数据帧中的列中的整数值是否存在于第 2 个数据帧中 2 列之间的范围拆分中？

Question

为了更好地解释这个问题：

我有2个数据框：

DF1（主）：

    CodeRange                                             Sector Start   End
0   0100-0999                  Agriculture, Forestry and Fishing  0100  0999
1   1000-1499                                             Mining  1000  1499
2   1500-1799                                       Construction  1500  1799
3   1800-1999                                           not used  1800  1999
4   2000-3999                                      Manufacturing  2000  3999
5   4000-4999  Transportation, Communications, Electric, Gas ...  4000  4999
6   5000-5199                                    Wholesale Trade  5000  5199
7   5200-5999                                       Retail Trade  5200  5999
8   6000-6799                 Finance, Insurance and Real Estate  6000  6799
9   7000-8999                                           Services  7000  8999
10  9100-9729                              Public Administration  9100  9729
11  9900-9999                                    Nonclassifiable  9900  9999

和 DF2：

    SICCode Sector
0   1230    Agro
1   4974    Utils
2   5120    shops
3   9997    Utils

在 DF1 中，我能够将“CodeRange”列值拆分为 2 列（“Start”和“End”）并将它们转换为 int。

我基本上想检查 DF2 中的每个 SICCode 是否存在于哪个范围之间，并将 DF2 中的“Sector”值更新为 DF1 中“Division”列下的相应值。

最终的 DF2 应如下所示：

DF2：

    SICCode Sector
0   1230    Agriculture, Forestry and Fishing
1   4974    Transportation, Communication...
2   5120    Wholesale Trade
3   9997    Non-classifiable

Answer 1

更紧凑的解决方案，没有循环

关键是通过将数字除以 1000 创建索引“start_idx”，以帮助我们合并，随后，我们检查 SICCode 是否在范围内，当它不在时，我们将除法设为空白

df3= df.assign(start_idx=(df['Start']//1000).astype(int)).merge(
    df2.assign(start_idx=(df2['SICCode']//1000).astype(int)), on='start_idx', how='left')
df3['Divison']=np.where( (df3['SICCode']> df3['Start']) &
                       (  df3['SICCode']<=df3['End']  ), df3['Sector_y'], "")
df3.drop(columns=['start_idx','x_y','SICCode','Sector_y'])

    x_x     CodeRange   Sector_x                        Start   End     Divison
0   0   0100-0999   Agriculture, Forestry and Fishing   100     999     
1   1   1000-1499   Mining                              1000    1499    Agro
2   2   1500-1799   Construction                        1500    1799    
3   3   1800-1999   not used                            1800    1999    
4   4   2000-3999   Manufacturing                       2000    3999    
5   5   4000-4999   Transportation, Communications, Electric, Gas ...   4000    4999    Utils
6   6   5000-5199   Wholesale Trade                      5000   5199    Shops
7   7   5200-5999   Retail Trade                        5200    5999    
8   8   6000-6799   Finance, Insurance and Real Estate  6000    6799    
9   9   7000-8999   Services                            7000    8999    
10  10  9100-9729   Public Administration               9100    9729    
11  11  9900-9999   Nonclassifiable                     9900    9999    Utils

Answer 2

您绝对可以使用我认为的掩码来优化我的解决方案，但您可以通过以下方式实现：

data = []
for i in range(len(df2)):
    code = df2["SICCode"].iloc[i]
    for j in range(len(df1)):
        start = df1["Start"].iloc[j]
        end = df1["End"].iloc[j]
        if code >= start and code <= end:
            data.append(df1["Sector"].iloc[j])
            continue # to move to the next i

df2["Sector"] = data

如何检查 1 个数据帧中的列中的整数值是否存在于第 2 个数据帧中 2 列之间的范围拆分中？

问题描述

2 个解决方案

解决方案1
1 2022-06-22 17:17:12

解决方案2
0 2022-06-22 16:37:42

如何检查 1 个数据帧中的列中的整数值是否存在于第 2 个数据帧中 2 列之间的范围拆分中？

问题描述

2 个解决方案

解决方案1 1 2022-06-22 17:17:12

解决方案2 0 2022-06-22 16:37:42

解决方案1
1 2022-06-22 17:17:12

解决方案2
0 2022-06-22 16:37:42