简体   繁体   中英

python pandas how to merge/join two tables based on substring?

Let's say I have two dataframes, and the column names for both are:

table 1 columns:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]

I want to merge the two tables when either 'ShipNumber' or 'TrackNumber' from table 2 can be found in 'Comment' from table 1.

Also, I'll explain why

merged = pd.merge(df1,df2,how='left',left_on='Comment',right_on='ShipNumber')

does not work in this case.

"Comment" column is a block of texts that can contain anything, so I cannot do an exact match like tab2.ShipNumber == tab1.Comment, because tab2.ShipNumber or tab2.TrackNumber can be found as a substring in tab1.Comment.

The desired output table should have all the unique columns from two tables:

output table column names:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight, AmountReceived]

I hope my question makes sense... Any help is really really appreciated!

note

The ultimate goal is to merge two sets with (shipnumber==shipnumber |tracknumber == tracknumber | shipnumber in comments | tracknumber in comments) , but I've created two subsets for the first two conditions, and now I'm working on the 3rd and 4th conditions.

Here is an example based on some made up data. Ignore the complete nonsense I've put in the dataframes, I was just typing in random stuff to get a sample df to play with.

import pandas as pd
import re

x = pd.DataFrame({'Location': ['Chicago','Houston','Los Angeles','Boston','NYC','blah'],
                  'Comments': ['chicago is winter','la is summer','boston is winter','dallas is spring','NYC is spring','seattle foo'],
                  'Dir':      ['N','S','E','W','S','E']})

y = pd.DataFrame({'Location': ['Miami','Dallas'],
                  'Season':   ['Spring','Fall']})


def findval(row):
    comment, location, season = map(lambda x: str(x).lower(),row)
    return location in comment or season in comment

merged = pd.concat([x,y])

merged['Helper'] = merged[['Comments','Location','Season']].apply(findval,axis=1)
print(merged)
filtered = merged[merged['Helper'] == True]
print(filtered)

Rather than joining, you can conatenate the dataframes, and then create a helper to see if the string of one column is found in another. Once you have that helper column, just filter out the True's.

why not do something like

Count = 0
def MergeFunction(rowElement):
    global Count
    df2_row = df2.iloc[[Count]]
    if(df2_row['ShipNumber'] in rowElement['Comments'] or df2_row['TrackNumber'] 
       in rowElement['Comments']
    rowElement['Amount'] = df2_row['Amount']
    Count+=1
    return rowElement

df1['Amount'] = sparseArray #Fill with zeros
new_df = df1.apply(MergeFunction)

您可以使用类似Whoosh的库对注释字段建立索引,然后对要搜索的每个装运号进行文本搜索。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM