简体   繁体   中英

How to lookup/find the value in a two-columns range from another Dataframe? - Python Pandas Dataframe

I have a question about Pandas Dataframe. There are two tables, 1 table is a mapping table, and 2nd table is a transactional date.

In the mapping table, there are two columns with a range of From and To.

Below are the two dataframes:

1). The df1 is the mapping table with a range of account numbers to map to a specific tax type.

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Category':['FBT Tax','CIT','GST','Stamp Duty','Sales Tax'],
          'GL From':['10000000','20000000','30000000','40000000','50000000'],
          'GL To':['10009999','20009999','30009999','40009999','50009999']})


     Category   GL From     GL To
0     FBT Tax  10000000  10009999
1         CIT  20000000  20009999
2         GST  30000000  30009999
3  Stamp Duty  40000000  40009999
4   Sales Tax  50000000  50009999

2). The df2 is the transactional table (there should be more columns I skipped for this demo), with the account number that I want to use to search/lookup in the range in df1.

df2 = pd.DataFrame({'Date':['1/10/19','2/10/19','3/10/19','10/11/19','12/12/19','30/08/19','01/07/19'],
          'GL Account':['20000456','30000199','20004689','40008900','50000876','10000325','70000199'],
          'Product LOB':['Computer','Mobile Phone','TV','Fridge','Dishwasher','Tablet','Table']})

       Date GL Account   Product LOB
0   1/10/19   20000456      Computer
1   2/10/19   30000199  Mobile Phone
2   3/10/19   20004689            TV
3  10/11/19   40008900        Fridge
4  12/12/19   50000876    Dishwasher
5  30/08/19   10000325        Tablet
6  01/07/19   70000199        Table

In the df1 and df2, the account numbers are in String dtype. Hence, I created a simple function to convert into Integer.

def to_integer(col):
    return pd.to_numeric(col,downcast='integer')

I have tried both np.dot and .loc to map the Category column, but I encountered this error: ValueError: Can only compare identically-labeled Series objects

result = np.dot((to_integer(df2['GL Account']) >= to_integer(df1['GL From'])) &
                 (to_integer(df2['GL Account']) <= to_integer(df1['GL To'])),df1['Category'])


result = df1.loc[(to_integer(df2['GL Account']) >= to_integer(df1['GL From'])) &
                 (to_integer(df2['GL Account']) <= to_integer(df1['GL To'])),"Category"]

What I want to achieve is like below:

       Date GL Account   Product LOB   Category
0   1/10/19   20000456      Computer   CIT
1   2/10/19   30000199  Mobile Phone   GST
2   3/10/19   20004689            TV   CIT
3  10/11/19   40008900        Fridge   Stamp Duty
4  12/12/19   50000876    Dishwasher   Sales Tax
5  30/08/19   10000325        Tablet   FBT Tax
6  01/07/19   70000199        Table    NaN

Is there anyway to map between two dataframes based on From-To range?

In case your data follows the pattern provided, you can create a column that has the lower bound value of each account and then merge on it:

df1['GL From'] = df1['GL From'].astype(int) #make it integer

### create lower bound
df2['lbound'] = df2['GL Account'].astype(int)//10000000*10000000

### merge
df2.merge(df1, left_on='lbound', right_on='GL From')\
               .drop(['lbound','GL From','GL To'], axis=1)

Output

    Date        GL Account  Product LOB     Category
0   1/10/19     20000456    Computer        CIT
1   3/10/19     20004689    TV              CIT
2   2/10/19     30000199    Mobile Phone    GST
3   10/11/19    40008900    Fridge          Stamp Duty
4   12/12/19    50000876    Dishwasher      Sales Tax
5   30/08/19    10000325    Tablet          FBT Tax

Added

In case the data does not follow a specific patter, you can use np.intersect1d with np.where to find out the lower bound and upper bound intersection, and therefore the index of the matched range.

For instance:

### func to get the index where account is greater or equal to `FROM` and lower or equal to `TO`
@np.vectorize
def match_ix(acc_no):
    return np.intersect1d(np.where(acc_no>=df1['GL From'].values),np.where(acc_no<=df1['GL To'].values))

## Apply to dataframe
df2['right_ix'] = match_ix(df2['GL Account'])


## Merge using the index. Use 'how=left' for the left join to preserve unmatched
df2.merge(df1, left_on='right_ix', right_on=df1.index, how='left')\
                .drop(['right_ix','GL From','GL To'], axis=1)

Output

    Date        GL Account  Product LOB     Category
0   1/10/19     20000456    Computer        CIT
1   3/10/19     20004689    TV              CIT
2   2/10/19     30000199    Mobile Phone    GST
3   10/11/19    40008900    Fridge          Stamp Duty
4   12/12/19    50000876    Dishwasher      Sales Tax
5   30/08/19    10000325    Tablet          FBT Tax

In terms of performance, you will get something quicker and without the issue of Memory Error you might have on full joins:

### Using 100* the sample provided
tempdf2 = pd.concat([df2]*100)
tempdf1 = pd.concat([df1]*100)

#23 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas >= 0.25.0

We can do a cartesian merge by first assigning two artificial columns called key and joining on these. Then we can use query to filter everything between the correct ranges. Notice that we use backtick ( ) to get our columns with spaces in the name, this is pandas >= 0.25.0`:

df2.assign(key=1).merge(df1.assign(key=1), on='key')\
                 .drop(columns='key')\
                 .query('`GL Account`.between(`GL From`, `GL To`)')\
                 .drop(columns=['GL From', 'GL To'])\
                 .reset_index(drop=True)

If you use left join , replace the .query part with:

.query('`GL Account`.between(`GL From`, `GL To`) | `GL From`.isna()')

To keep the rows which didn't match in the join


Or

Pandas < 0.25.0

Simple boolean indexing

mrg = df2.assign(key=1).merge(df1.assign(key=1), on='key')\
                       .drop(columns='key')

mrg[mrg['GL Account'].between(mrg['GL From'], mrg['GL To'])]\
                     .drop(columns=['GL From', 'GL To'])\
                     .reset_index(drop=True)

Output

       Date  GL Account   Product LOB    Category
0   1/10/19    20000456      Computer         CIT
1   2/10/19    30000199  Mobile Phone         GST
2   3/10/19    20004689            TV         CIT
3  10/11/19    40008900        Fridge  Stamp Duty
4  12/12/19    50000876    Dishwasher   Sales Tax
5  30/08/19    10000325        Tablet     FBT Tax

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM