I have a question about Pandas Dataframe. There are two tables, 1 table is a mapping table, and 2nd table is a transactional date.
In the mapping table, there are two columns with a range of From and To.
Below are the two dataframes:
1). The df1 is the mapping table with a range of account numbers to map to a specific tax type.
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'Category':['FBT Tax','CIT','GST','Stamp Duty','Sales Tax'],
'GL From':['10000000','20000000','30000000','40000000','50000000'],
'GL To':['10009999','20009999','30009999','40009999','50009999']})
Category GL From GL To
0 FBT Tax 10000000 10009999
1 CIT 20000000 20009999
2 GST 30000000 30009999
3 Stamp Duty 40000000 40009999
4 Sales Tax 50000000 50009999
2). The df2 is the transactional table (there should be more columns I skipped for this demo), with the account number that I want to use to search/lookup in the range in df1.
df2 = pd.DataFrame({'Date':['1/10/19','2/10/19','3/10/19','10/11/19','12/12/19','30/08/19','01/07/19'],
'GL Account':['20000456','30000199','20004689','40008900','50000876','10000325','70000199'],
'Product LOB':['Computer','Mobile Phone','TV','Fridge','Dishwasher','Tablet','Table']})
Date GL Account Product LOB
0 1/10/19 20000456 Computer
1 2/10/19 30000199 Mobile Phone
2 3/10/19 20004689 TV
3 10/11/19 40008900 Fridge
4 12/12/19 50000876 Dishwasher
5 30/08/19 10000325 Tablet
6 01/07/19 70000199 Table
In the df1 and df2, the account numbers are in String dtype. Hence, I created a simple function to convert into Integer.
def to_integer(col):
return pd.to_numeric(col,downcast='integer')
I have tried both np.dot and .loc to map the Category column, but I encountered this error: ValueError: Can only compare identically-labeled Series objects
result = np.dot((to_integer(df2['GL Account']) >= to_integer(df1['GL From'])) &
(to_integer(df2['GL Account']) <= to_integer(df1['GL To'])),df1['Category'])
result = df1.loc[(to_integer(df2['GL Account']) >= to_integer(df1['GL From'])) &
(to_integer(df2['GL Account']) <= to_integer(df1['GL To'])),"Category"]
What I want to achieve is like below:
Date GL Account Product LOB Category
0 1/10/19 20000456 Computer CIT
1 2/10/19 30000199 Mobile Phone GST
2 3/10/19 20004689 TV CIT
3 10/11/19 40008900 Fridge Stamp Duty
4 12/12/19 50000876 Dishwasher Sales Tax
5 30/08/19 10000325 Tablet FBT Tax
6 01/07/19 70000199 Table NaN
Is there anyway to map between two dataframes based on From-To range?
In case your data follows the pattern provided, you can create a column that has the lower bound value of each account and then merge on it:
df1['GL From'] = df1['GL From'].astype(int) #make it integer
### create lower bound
df2['lbound'] = df2['GL Account'].astype(int)//10000000*10000000
### merge
df2.merge(df1, left_on='lbound', right_on='GL From')\
.drop(['lbound','GL From','GL To'], axis=1)
Output
Date GL Account Product LOB Category
0 1/10/19 20000456 Computer CIT
1 3/10/19 20004689 TV CIT
2 2/10/19 30000199 Mobile Phone GST
3 10/11/19 40008900 Fridge Stamp Duty
4 12/12/19 50000876 Dishwasher Sales Tax
5 30/08/19 10000325 Tablet FBT Tax
Added
In case the data does not follow a specific patter, you can use np.intersect1d
with np.where
to find out the lower bound and upper bound intersection, and therefore the index of the matched range.
For instance:
### func to get the index where account is greater or equal to `FROM` and lower or equal to `TO`
@np.vectorize
def match_ix(acc_no):
return np.intersect1d(np.where(acc_no>=df1['GL From'].values),np.where(acc_no<=df1['GL To'].values))
## Apply to dataframe
df2['right_ix'] = match_ix(df2['GL Account'])
## Merge using the index. Use 'how=left' for the left join to preserve unmatched
df2.merge(df1, left_on='right_ix', right_on=df1.index, how='left')\
.drop(['right_ix','GL From','GL To'], axis=1)
Output
Date GL Account Product LOB Category
0 1/10/19 20000456 Computer CIT
1 3/10/19 20004689 TV CIT
2 2/10/19 30000199 Mobile Phone GST
3 10/11/19 40008900 Fridge Stamp Duty
4 12/12/19 50000876 Dishwasher Sales Tax
5 30/08/19 10000325 Tablet FBT Tax
In terms of performance, you will get something quicker and without the issue of Memory Error
you might have on full joins:
### Using 100* the sample provided
tempdf2 = pd.concat([df2]*100)
tempdf1 = pd.concat([df1]*100)
#23 ms ± 170 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
0.25.0
We can do a cartesian merge
by first assigning two artificial columns called key
and joining on these. Then we can use query
to filter everything between
the correct ranges. Notice that we use backtick ( ) to get our columns with spaces in the name, this is
pandas >= 0.25.0`:
df2.assign(key=1).merge(df1.assign(key=1), on='key')\
.drop(columns='key')\
.query('`GL Account`.between(`GL From`, `GL To`)')\
.drop(columns=['GL From', 'GL To'])\
.reset_index(drop=True)
If you use left join
, replace the .query
part with:
.query('`GL Account`.between(`GL From`, `GL To`) | `GL From`.isna()')
To keep the rows which didn't match in the join
Or
0.25.0
Simple boolean indexing
mrg = df2.assign(key=1).merge(df1.assign(key=1), on='key')\
.drop(columns='key')
mrg[mrg['GL Account'].between(mrg['GL From'], mrg['GL To'])]\
.drop(columns=['GL From', 'GL To'])\
.reset_index(drop=True)
Output
Date GL Account Product LOB Category
0 1/10/19 20000456 Computer CIT
1 2/10/19 30000199 Mobile Phone GST
2 3/10/19 20004689 TV CIT
3 10/11/19 40008900 Fridge Stamp Duty
4 12/12/19 50000876 Dishwasher Sales Tax
5 30/08/19 10000325 Tablet FBT Tax
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.