简体   繁体   中英

How to merge on multiple columns and then if there is not a match, merge on different columns in pandas?

That was not easy to put into one sentence, basically, I have two datasets I would like to combine on two datapoints--the name and the date. I've provided a short example here of how the data is structured: https://ethercalc.net/a4k8lejblmhe

Year    Name    Alternative Name    Favorite Pet
1998    William Bill                Cat
1995    James   Jim                 Dog
1956    Robert  Bob                 Hamster
Year    Name     Sales
1998    William  2000
1995    Jim      3005
1956    Bob      6000

EXPECTED:

Year    Name    Sales   Favorite Pet
1998    William 2000    Cat
1995    Jim     3005    Dog
1956    Bob     6000    Hamster

However, one of the datasets have both a name and an alternative name. These are fairly large datasets, so I would like to cover all my bases by merging on both the name and alternative name and date. I know how to combine on just the year and name:

nameCombined = names1.merge(names2, left_on=["year", "name"], right_on=["year", "name"], how='left')

That being said, what is the best way to use some kind of conditional that says if there's no match between the year and the regular name, check the year and the alternative name before assigning null values for the merge?

left merge on ["Year", "Name"] then left merge on ["Year", "Alternative Name"] (separately) then combine them and remove duplicates.

This assumes that the original order doesn't matter, if it does tell me and I'll show you how to keep that.

nameCombined = names1[["Year", "Name", "Favorite Pet"]].merge(names2, left_on=["Year", "Name"], right_on=["Year", "Name"], how='left')

AlternativeNameCombined = names1[["Year", "Alternative Name", "Favorite Pet"]].merge(names2, left_on=["Year", "Alternative Name"], right_on=["Year", "Name"], how='left')
AlternativeNameCombined.columns = ["Year", "Name", "Sales", "Favorite Pet"]

allCombined = nameCombined.append(AlternativeNameCombined).drop_duplicates(subset=["Year", "Name"], keep="first").reset_index(drop=True)

Here is an example using 2 inner join + concat :

df1 = pd.DataFrame({
    'Year': (1998, 1995, 1956,),
    'Name': ('William', 'James', 'Robert'),
    'Alternative Name': ('Bill', 'Jim', 'Bob'),
    'Favorite Pet': ('Cat', 'Dog', 'Hamster'),
})

df2 = pd.DataFrame({
    'Year': (1998, 1995, 1956,),
    'Name': ('William', 'Jim', 'Bob'),
    'Sales': (2000, 3005, 6000),
})

# by Name
df = df1.drop(columns=['Alternative Name']).merge(df2, on=['Year', 'Name'])
# by Alternative Name
df1 = df1.drop(columns=['Name']).rename(columns={'Alternative Name': 'Name'})
# union
df = pd.concat([
    df,
    df2.merge(df1, on=['Year', 'Name'])
], sort=False)

print(df)
#    Year     Name Favorite Pet  Sales
# 0  1998  William          Cat   2000
# 0  1995      Jim          Dog   3005
# 1  1956      Bob      Hamster   6000

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM