当数据在范围内时使用 python 进行 Vlookup

Question

I have two excel files, I want to perform vlookup and find difference of costs using python or even excel.我有两个 excel 文件，我想使用 python 甚至 excel 执行 vlookup 并找到成本差异。

My files look like this我的文件看起来像这样

source_data.xlsx contains contains distance covered and their price, example distance range from 1 to 100 should be charged 4800 and distance range from 101 to 120 should be charged 5100. source_data.xlsx包含所覆盖的距离及其价格，例如从 1 到 100 的距离范围应收费 4800，从 101 到 120 的距离范围应收费 5100。

DISTANCE     COST

1-100        4800

101-120      5100

121-140      5500

141-160      5900

161-180      6200

181-200      6600

210-220      6900

221-240      7200

Analysis.xlsx分析.xlsx

loading_station  distance_travel     total_cost    status

PUGU                  40                4000       PAID


PUGU                  80                3200       PAID

MOROGORO              50                5000       PAID

MOROGORO              220               30400      PAID

DODOMA                150               5100       PAID

KIGOMA                90                2345       PAID

DODOMA                230               6000       PAID

DODOMA                180               16500      PAID

KIGOMA                32                3000       PAID

DODOMA                45                6000       PAID

DODOMA                65                5000       PAID

KIGOMA                77                1000       PAID

KIGOMA                90                4000       PAID

Actual Cost for distance is given in source_data.xlsx , I want to check cost in Analysis.xlsx if it correspond to Actual value, I want to detect underpayment and overpayment.距离的实际成本在source_data.xlsx中给出，我想检查Analysis.xlsx中的成本是否对应于实际值，我想检测支付不足和多付。

Desired Output should be like this, with two column added, source_cost which is taken from source_xlsx by using vlookup and difference which is difference between total_cost and source_cost所需的 Output 应该是这样的，添加了两列， source_cost是使用vlookup从source_xlsx的，而差异是total_cost和source_cost之间的差异

loading_station distance_travel total_cost  status  source_cost Difference

PUGU               40                4000     PAID     4800        -800

PUGU               80                3200     PAID     4800        -1600

MOROGORO           50                5000     PAID     4800         200

MOROGORO           220               30400    PAID     6900         23500

DODOMA             150               5100     PAID     5900         -800

KIGOMA             90                2345     PAID     4800         -2455

DODOMA             230               6000     PAID     7200         -1200

DODOMA             180               16500    PAID     6200          10300

KIGOMA             32                3000     PAID     4800          -1800

DODOMA             45                6000     PAID     4800           1200

DODOMA             65                5000     PAID     4800           200

KIGOMA             77                1000     PAID     4800           -3800

KIGOMA             90                4000     PAID     4800           -800

My code so far到目前为止我的代码

# import pandas
import pandas as pd

# read excel data
source_data = pd.read_excel('source_data.xlsx')
analysis_file = pd.read_excel('analysis.xlsx')
source_data.head(5)
analysis_file.head(5)

Answer 1

You can use merge_asof :您可以使用merge_asof ：

source_data["DISTANCE"] = source_data["DISTANCE"].str.split("-").str[1].astype("int64")
res = (pd.merge_asof(analysis_file.reset_index().sort_values("distance_travel"),
                     source_data,
                     left_on="distance_travel",
                     right_on="DISTANCE",
                     direction="forward")
       .set_index("index")
       .sort_index())
res["Difference"] = res["total_cost"] - res["COST"]

print (res)

      loading_station  distance_travel  total_cost status  DISTANCE  COST  Difference
index
0                PUGU               40        4000   PAID       100  4800        -800
1                PUGU               80        3200   PAID       100  4800       -1600
2            MOROGORO               50        5000   PAID       100  4800         200
3            MOROGORO              220       30400   PAID       220  6900       23500
4              DODOMA              150        5100   PAID       160  5900        -800
5              KIGOMA               90        2345   PAID       100  4800       -2455
6              DODOMA              230        6000   PAID       240  7200       -1200
7              DODOMA              180       16500   PAID       180  6200       10300
8              KIGOMA               32        3000   PAID       100  4800       -1800
9              DODOMA               45        6000   PAID       100  4800        1200
10             DODOMA               65        5000   PAID       100  4800         200
11             KIGOMA               77        1000   PAID       100  4800       -3800
12             KIGOMA               90        4000   PAID       100  4800        -800

Note that this does not take care of 0 distance traveled.请注意，这不考虑 0 行驶距离。 You need to handle that separately.您需要单独处理。

Answer 2

Since it is a categorical bins problem, I suggest utilizing cut() and find the corresponding value.由于这是一个分类箱问题，我建议使用cut()并找到相应的值。

import pandas as pd
# create bins
bh = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,0]
bt = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,1]
bins = pd.IntervalIndex.from_arrays(bh, bt, closed='both')

print(bins)
###
IntervalIndex([[1, 100], [101, 120], [121, 140], [141, 160], [161, 180], [181, 200], [210, 220], [221, 240]], dtype='interval[int64, both]')

As it shown, IntervalIndex , dtype='interval[int64, both]'如图所示， IntervalIndex , dtype='interval[int64, both]'

# find corresponding values
df_analysis['source_cost'] = pd.cut(df_analysis['distance_travel'], bins=bins).map(dict(zip(bins, df_source['COST']))).astype(int)

# calculation
df_analysis['Difference'] = df_analysis['total_cost'] - df_analysis['source_cost']

print(df_analysis)
###

loading_station加载站	distance_travel distance_travel	total_cost总消耗	status地位	source_cost source_cost	Difference区别
PUGU普谷	40 40	4000 4000	PAID有薪酬的	4800 4800	-800 -800
PUGU普谷	80 80	3200 3200	PAID有薪酬的	4800 4800	-1600 -1600
MOROGORO莫罗五郎	50 50	5000 5000	PAID有薪酬的	4800 4800	200 200
MOROGORO莫罗五郎	220 220	30400 30400	PAID有薪酬的	6900 6900	23500 23500
DODOMA多多玛	150 150	5100 5100	PAID有薪酬的	5900 5900	-800 -800
KIGOMA基戈马	90 90	2345 2345	PAID有薪酬的	4800 4800	-2455 -2455
DODOMA多多玛	230 230	6000 6000	PAID有薪酬的	7200 7200	-1200 -1200
DODOMA多多玛	180 180	16500 16500	PAID有薪酬的	6200 6200	10300 10300
KIGOMA基戈马	32 32	3000 3000	PAID有薪酬的	4800 4800	-1800 -1800
DODOMA多多玛	45 45	6000 6000	PAID有薪酬的	4800 4800	1200 1200
DODOMA多多玛	65 65	5000 5000	PAID有薪酬的	4800 4800	200 200
KIGOMA基戈马	77 77	1000 1000	PAID有薪酬的	4800 4800	-3800 -3800
KIGOMA基戈马	90 90	4000 4000	PAID有薪酬的	4800 4800	-800 -800

当数据在范围内时使用 python 进行 Vlookup

问题描述

1 个解决方案

解决方案1
0 2022-08-11 17:40:16

解决方案2
0 已采纳 2022-08-11 18:18:53

当数据在范围内时使用 python 进行 Vlookup

问题描述

1 个解决方案

解决方案1 0 2022-08-11 17:40:16

解决方案2 0 已采纳 2022-08-11 18:18:53

解决方案1
0 2022-08-11 17:40:16

解决方案2
0 已采纳 2022-08-11 18:18:53