简体   繁体   English

当数据在范围内时使用 python 进行 Vlookup

[英]Vlookup using python when data given in range

I have two excel files, I want to perform vlookup and find difference of costs using python or even excel.我有两个 excel 文件,我想使用 python 甚至 excel 执行 vlookup 并找到成本差异。

My files look like this我的文件看起来像这样

source_data.xlsx contains contains distance covered and their price, example distance range from 1 to 100 should be charged 4800 and distance range from 101 to 120 should be charged 5100. source_data.xlsx包含所覆盖的距离及其价格,例如从 1 到 100 的距离范围应收费 4800,从 101 到 120 的距离范围应收费 5100。

DISTANCE     COST

1-100        4800

101-120      5100

121-140      5500

141-160      5900

161-180      6200

181-200      6600

210-220      6900

221-240      7200

Analysis.xlsx分析.xlsx

loading_station  distance_travel     total_cost    status

PUGU                  40                4000       PAID


PUGU                  80                3200       PAID

MOROGORO              50                5000       PAID

MOROGORO              220               30400      PAID

DODOMA                150               5100       PAID

KIGOMA                90                2345       PAID

DODOMA                230               6000       PAID

DODOMA                180               16500      PAID

KIGOMA                32                3000       PAID

DODOMA                45                6000       PAID

DODOMA                65                5000       PAID

KIGOMA                77                1000       PAID

KIGOMA                90                4000       PAID

Actual Cost for distance is given in source_data.xlsx , I want to check cost in Analysis.xlsx if it correspond to Actual value, I want to detect underpayment and overpayment.距离的实际成本在source_data.xlsx中给出,我想检查Analysis.xlsx中的成本是否对应于实际值,我想检测支付不足和多付。

Desired Output should be like this, with two column added, source_cost which is taken from source_xlsx by using vlookup and difference which is difference between total_cost and source_cost所需的 Output 应该是这样的,添加了两列, source_cost是使用vlookupsource_xlsx的,而差异是total_costsource_cost之间的差异

loading_station distance_travel total_cost  status  source_cost Difference

PUGU               40                4000     PAID     4800        -800

PUGU               80                3200     PAID     4800        -1600

MOROGORO           50                5000     PAID     4800         200

MOROGORO           220               30400    PAID     6900         23500

DODOMA             150               5100     PAID     5900         -800

KIGOMA             90                2345     PAID     4800         -2455

DODOMA             230               6000     PAID     7200         -1200

DODOMA             180               16500    PAID     6200          10300

KIGOMA             32                3000     PAID     4800          -1800

DODOMA             45                6000     PAID     4800           1200

DODOMA             65                5000     PAID     4800           200

KIGOMA             77                1000     PAID     4800           -3800

KIGOMA             90                4000     PAID     4800           -800

My code so far到目前为止我的代码

# import pandas
import pandas as pd

# read excel data
source_data = pd.read_excel('source_data.xlsx')
analysis_file = pd.read_excel('analysis.xlsx')
source_data.head(5)
analysis_file.head(5)

You can use merge_asof :您可以使用merge_asof

source_data["DISTANCE"] = source_data["DISTANCE"].str.split("-").str[1].astype("int64")
res = (pd.merge_asof(analysis_file.reset_index().sort_values("distance_travel"),
                     source_data,
                     left_on="distance_travel",
                     right_on="DISTANCE",
                     direction="forward")
       .set_index("index")
       .sort_index())
res["Difference"] = res["total_cost"] - res["COST"]

print (res)

      loading_station  distance_travel  total_cost status  DISTANCE  COST  Difference
index
0                PUGU               40        4000   PAID       100  4800        -800
1                PUGU               80        3200   PAID       100  4800       -1600
2            MOROGORO               50        5000   PAID       100  4800         200
3            MOROGORO              220       30400   PAID       220  6900       23500
4              DODOMA              150        5100   PAID       160  5900        -800
5              KIGOMA               90        2345   PAID       100  4800       -2455
6              DODOMA              230        6000   PAID       240  7200       -1200
7              DODOMA              180       16500   PAID       180  6200       10300
8              KIGOMA               32        3000   PAID       100  4800       -1800
9              DODOMA               45        6000   PAID       100  4800        1200
10             DODOMA               65        5000   PAID       100  4800         200
11             KIGOMA               77        1000   PAID       100  4800       -3800
12             KIGOMA               90        4000   PAID       100  4800        -800

Note that this does not take care of 0 distance traveled.请注意,这不考虑 0 行驶距离。 You need to handle that separately.您需要单独处理。

Since it is a categorical bins problem, I suggest utilizing cut() and find the corresponding value.由于这是一个分类箱问题,我建议使用cut()并找到相应的值。

import pandas as pd
# create bins
bh = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,0]
bt = df_source['DISTANCE'].apply(lambda x: x.split('-')).apply(pd.Series).astype(int).values[:,1]
bins = pd.IntervalIndex.from_arrays(bh, bt, closed='both')

print(bins)
###
IntervalIndex([[1, 100], [101, 120], [121, 140], [141, 160], [161, 180], [181, 200], [210, 220], [221, 240]], dtype='interval[int64, both]')

As it shown, IntervalIndex , dtype='interval[int64, both]'如图所示, IntervalIndex , dtype='interval[int64, both]'


# find corresponding values
df_analysis['source_cost'] = pd.cut(df_analysis['distance_travel'], bins=bins).map(dict(zip(bins, df_source['COST']))).astype(int)

# calculation
df_analysis['Difference'] = df_analysis['total_cost'] - df_analysis['source_cost']

print(df_analysis)
###
loading_station加载站 distance_travel distance_travel total_cost总消耗 status地位 source_cost source_cost Difference区别
PUGU普谷 40 40 4000 4000 PAID有薪酬的 4800 4800 -800 -800
PUGU普谷 80 80 3200 3200 PAID有薪酬的 4800 4800 -1600 -1600
MOROGORO莫罗五郎 50 50 5000 5000 PAID有薪酬的 4800 4800 200 200
MOROGORO莫罗五郎 220 220 30400 30400 PAID有薪酬的 6900 6900 23500 23500
DODOMA多多玛 150 150 5100 5100 PAID有薪酬的 5900 5900 -800 -800
KIGOMA基戈马 90 90 2345 2345 PAID有薪酬的 4800 4800 -2455 -2455
DODOMA多多玛 230 230 6000 6000 PAID有薪酬的 7200 7200 -1200 -1200
DODOMA多多玛 180 180 16500 16500 PAID有薪酬的 6200 6200 10300 10300
KIGOMA基戈马 32 32 3000 3000 PAID有薪酬的 4800 4800 -1800 -1800
DODOMA多多玛 45 45 6000 6000 PAID有薪酬的 4800 4800 1200 1200
DODOMA多多玛 65 65 5000 5000 PAID有薪酬的 4800 4800 200 200
KIGOMA基戈马 77 77 1000 1000 PAID有薪酬的 4800 4800 -3800 -3800
KIGOMA基戈马 90 90 4000 4000 PAID有薪酬的 4800 4800 -800 -800

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM