简体   繁体   中英

Pandas merge on part of two columns

I have two dataframes with a common column called 'upc' as such:

df1:

 upc 
 23456793749
 78907809834
 35894796324
 67382808404
 93743008374

df2:

 upc
 4567937
 9078098
 8947963
 3828084
 7430083

Notice that df2 'upc' values are the innermost 7 values of df1 'upc' values. Note that both df1 and df2 have other columns not shown above. What I want to do is do an inner merge on 'upc' but only on the innermost 7 values. How can I achieve this?

Using str.extact , match all items in df1 with df2, then we using the result as merge key merge with df2

df1['keyfordf2']=df1.astype(str).upc.str.extract(r'({})'.format('|'.join(df2.upc.astype(str).tolist())),expand=True).fillna(False)


df1.merge(df2.astype(str),left_on='keyfordf2',right_on='upc')
Out[273]: 
         upc_x keyfordf2    upc_y
0  23456793749   4567937  4567937
1  78907809834   9078098  9078098
2  35894796324   8947963  8947963
3  67382808404   3828084  3828084
4  93743008374   7430083  7430083

1) Create both dataframes and convert to string type.

2) pd.merge the two frames, but using the left_on keyword to access the inner 7 characters of your 'upc' series

df1 = pd.DataFrame(data=[ 
 23456793749,
 78907809834,
 35894796324,
 67382808404,
 93743008374,], columns = ['upc1'])
df1 = df1.astype(str)

df2 = pd.DataFrame(data=[ 
 4567937,
 9078098,
 8947963,
 3828084,
 7430083,], columns = ['upc2'])
df2 = df2.astype(str)

pd.merge(df1, df2, left_on=df1['upc1'].astype(str).str[2:-2], right_on='upc2', how='inner')

Out[5]: 
          upc1     upc2
0  23456793749  4567937
1  78907809834  9078098
2  35894796324  8947963
3  67382808404  3828084
4  93743008374  7430083

You could make a new column in df1 and merge on that.

import pandas as pd
df1= pd.DataFrame({'upc': [ 23456793749, 78907809834, 35894796324, 67382808404, 93743008374]})
df2= pd.DataFrame({'upc': [ 4567937, 9078098, 8947963, 3828084, 7430083]})

df1['upc_old'] = df1['upc'] #in case you still need the old (longer) upc column
df1['upc'] = df1['upc'].astype(str).str[2:-2].astype(int)

merged_df = pd.merge(df1, df2, on='upc')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM