[英]Vlookup using python when data given in range
I have two excel files, I want to perform vlookup and find difference of costs using python or even excel.我有两个 excel 文件,我想使用 python 甚至 excel 执行 vlookup 并找到成本差异。
My files look like this我的文件看起来像这样
source_data.xlsx contains contains distance covered and their price, example distance range from 1 to 100 should be charged 4800 and distance range from 101 to 120 should be charged 5100. source_data.xlsx包含所覆盖的距离及其价格,例如从 1 到 100 的距离范围应收费 4800,从 101 到 120 的距离范围应收费 5100。
DISTANCE COST
1-100 4800
101-120 5100
121-140 5500
141-160 5900
161-180 6200
181-200 6600
210-220 6900
221-240 7200
Analysis.xlsx分析.xlsx
loading_station distance_travel total_cost status
PUGU 40 4000 PAID
PUGU 80 3200 PAID
MOROGORO 50 5000 PAID
MOROGORO 220 30400 PAID
DODOMA 150 5100 PAID
KIGOMA 90 2345 PAID
DODOMA 230 6000 PAID
DODOMA 180 16500 PAID
KIGOMA 32 3000 PAID
DODOMA 45 6000 PAID
DODOMA 65 5000 PAID
KIGOMA 77 1000 PAID
KIGOMA 90 4000 PAID
Actual Cost for distance is given in source_data.xlsx
, I want to check cost in Analysis.xlsx
if it correspond to Actual value, I want to detect underpayment and overpayment.距离的实际成本在
source_data.xlsx
中给出,我想检查Analysis.xlsx
中的成本是否对应于实际值,我想检测支付不足和多付。
Desired Output should be like this, with two column added, source_cost
which is taken from source_xlsx
by using vlookup
and difference which is difference between total_cost
and source_cost
所需的 Output 应该是这样的,添加了两列,
source_cost
是使用vlookup
从source_xlsx
的,而差异是total_cost
和source_cost
之间的差异
loading_station distance_travel total_cost status source_cost Difference
PUGU 40 4000 PAID 4800 -800
PUGU 80 3200 PAID 4800 -1600
MOROGORO 50 5000 PAID 4800 200
MOROGORO 220 30400 PAID 6900 23500
DODOMA 150 5100 PAID 5900 -800
KIGOMA 90 2345 PAID 4800 -2455
DODOMA 230 6000 PAID 7200 -1200
DODOMA 180 16500 PAID 6200 10300
KIGOMA 32 3000 PAID 4800 -1800
DODOMA 45 6000 PAID 4800 1200
DODOMA 65 5000 PAID 4800 200
KIGOMA 77 1000 PAID 4800 -3800
KIGOMA 90 4000 PAID 4800 -800
My code so far到目前为止我的代码
# import pandas
import pandas as pd
# read excel data
source_data = pd.read_excel('source_data.xlsx')
analysis_file = pd.read_excel('analysis.xlsx')
source_data.head(5)
analysis_file.head(5)
You can use merge_asof
:您可以使用
merge_asof
:
source_data["DISTANCE"] = source_data["DISTANCE"].str.split("-").str[1].astype("int64")
res = (pd.merge_asof(analysis_file.reset_index().sort_values("distance_travel"),
source_data,
left_on="distance_travel",
right_on="DISTANCE",
direction="forward")
.set_index("index")
.sort_index())
res["Difference"] = res["total_cost"] - res["COST"]
print (res)
loading_station distance_travel total_cost status DISTANCE COST Difference
index
0 PUGU 40 4000 PAID 100 4800 -800
1 PUGU 80 3200 PAID 100 4800 -1600
2 MOROGORO 50 5000 PAID 100 4800 200
3 MOROGORO 220 30400 PAID 220 6900 23500
4 DODOMA 150 5100 PAID 160 5900 -800
5 KIGOMA 90 2345 PAID 100 4800 -2455
6 DODOMA 230 6000 PAID 240 7200 -1200
7 DODOMA 180 16500 PAID 180 6200 10300
8 KIGOMA 32 3000 PAID 100 4800 -1800
9 DODOMA 45 6000 PAID 100 4800 1200
10 DODOMA 65 5000 PAID 100 4800 200
11 KIGOMA 77 1000 PAID 100 4800 -3800
12 KIGOMA 90 4000 PAID 100 4800 -800
Note that this does not take care of 0 distance traveled.请注意,这不考虑 0 行驶距离。 You need to handle that separately.
您需要单独处理。
Since it is a categorical bins problem, I suggest utilizing cut()
and find the corresponding value.由于这是一个分类箱问题,我建议使用
cut()
并找到相应的值。
import pandas as pd
# create bins
bh = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,0]
bt = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,1]
bins = pd.IntervalIndex.from_arrays(bh, bt, closed='both')
print(bins)
###
IntervalIndex([[1, 100], [101, 120], [121, 140], [141, 160], [161, 180], [181, 200], [210, 220], [221, 240]], dtype='interval[int64, both]')
As it shown, IntervalIndex
, dtype='interval[int64, both]'
如图所示,
IntervalIndex
, dtype='interval[int64, both]'
# find corresponding values
df_analysis['source_cost'] = pd.cut(df_analysis['distance_travel'], bins=bins).map(dict(zip(bins, df_source['COST']))).astype(int)
# calculation
df_analysis['Difference'] = df_analysis['total_cost'] - df_analysis['source_cost']
print(df_analysis)
###
loading_station![]() |
distance_travel ![]() |
total_cost![]() |
status![]() |
source_cost ![]() |
Difference![]() |
---|---|---|---|---|---|
PUGU![]() |
40 ![]() |
4000 ![]() |
PAID![]() |
4800 ![]() |
-800 ![]() |
PUGU![]() |
80 ![]() |
3200 ![]() |
PAID![]() |
4800 ![]() |
-1600 ![]() |
MOROGORO![]() |
50 ![]() |
5000 ![]() |
PAID![]() |
4800 ![]() |
200 ![]() |
MOROGORO![]() |
220 ![]() |
30400 ![]() |
PAID![]() |
6900 ![]() |
23500 ![]() |
DODOMA![]() |
150 ![]() |
5100 ![]() |
PAID![]() |
5900 ![]() |
-800 ![]() |
KIGOMA![]() |
90 ![]() |
2345 ![]() |
PAID![]() |
4800 ![]() |
-2455 ![]() |
DODOMA![]() |
230 ![]() |
6000 ![]() |
PAID![]() |
7200 ![]() |
-1200 ![]() |
DODOMA![]() |
180 ![]() |
16500 ![]() |
PAID![]() |
6200 ![]() |
10300 ![]() |
KIGOMA![]() |
32 ![]() |
3000 ![]() |
PAID![]() |
4800 ![]() |
-1800 ![]() |
DODOMA![]() |
45 ![]() |
6000 ![]() |
PAID![]() |
4800 ![]() |
1200 ![]() |
DODOMA![]() |
65 ![]() |
5000 ![]() |
PAID![]() |
4800 ![]() |
200 ![]() |
KIGOMA![]() |
77 ![]() |
1000 ![]() |
PAID![]() |
4800 ![]() |
-3800 ![]() |
KIGOMA![]() |
90 ![]() |
4000 ![]() |
PAID![]() |
4800 ![]() |
-800 ![]() |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.