python pandas how to merge/join two tables based on substring?

Question

Let's say I have two dataframes, and the column names for both are:

table 1 columns:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight]
table 2 columns:
[ShipNumber, TrackNumber, AmountReceived]

I want to merge the two tables when either 'ShipNumber' or 'TrackNumber' from table 2 can be found in 'Comment' from table 1.

Also, I'll explain why

merged = pd.merge(df1,df2,how='left',left_on='Comment',right_on='ShipNumber')

does not work in this case.

"Comment" column is a block of texts that can contain anything, so I cannot do an exact match like tab2.ShipNumber == tab1.Comment, because tab2.ShipNumber or tab2.TrackNumber can be found as a substring in tab1.Comment.

The desired output table should have all the unique columns from two tables:

output table column names:
[ShipNumber, TrackNumber, Comment, ShipDate, Quantity, Weight, AmountReceived]

I hope my question makes sense... Any help is really really appreciated!

note

The ultimate goal is to merge two sets with (shipnumber==shipnumber |tracknumber == tracknumber | shipnumber in comments | tracknumber in comments) , but I've created two subsets for the first two conditions, and now I'm working on the 3rd and 4th conditions.

Answer 1

Here is an example based on some made up data. Ignore the complete nonsense I've put in the dataframes, I was just typing in random stuff to get a sample df to play with.

import pandas as pd
import re

x = pd.DataFrame({'Location': ['Chicago','Houston','Los Angeles','Boston','NYC','blah'],
                  'Comments': ['chicago is winter','la is summer','boston is winter','dallas is spring','NYC is spring','seattle foo'],
                  'Dir':      ['N','S','E','W','S','E']})

y = pd.DataFrame({'Location': ['Miami','Dallas'],
                  'Season':   ['Spring','Fall']})


def findval(row):
    comment, location, season = map(lambda x: str(x).lower(),row)
    return location in comment or season in comment

merged = pd.concat([x,y])

merged['Helper'] = merged[['Comments','Location','Season']].apply(findval,axis=1)
print(merged)
filtered = merged[merged['Helper'] == True]
print(filtered)

Rather than joining, you can conatenate the dataframes, and then create a helper to see if the string of one column is found in another. Once you have that helper column, just filter out the True's.

Answer 2

why not do something like

Count = 0
def MergeFunction(rowElement):
    global Count
    df2_row = df2.iloc[[Count]]
    if(df2_row['ShipNumber'] in rowElement['Comments'] or df2_row['TrackNumber'] 
       in rowElement['Comments']
    rowElement['Amount'] = df2_row['Amount']
    Count+=1
    return rowElement

df1['Amount'] = sparseArray #Fill with zeros
new_df = df1.apply(MergeFunction)

Answer 3

您可以使用类似Whoosh的库对注释字段建立索引，然后对要搜索的每个装运号进行文本搜索。

python pandas how to merge/join two tables based on substring?

Question

3 answers

solution1
0 2017-08-25 21:10:28

solution2
0 2017-08-25 21:29:18

solution3
0 2017-08-25 22:24:56

python pandas how to merge/join two tables based on substring?

Question

3 answers

solution1 0 2017-08-25 21:10:28

solution2 0 2017-08-25 21:29:18

solution3 0 2017-08-25 22:24:56

solution1
0 2017-08-25 21:10:28

solution2
0 2017-08-25 21:29:18

solution3
0 2017-08-25 22:24:56