简体   繁体   中英

Joining two dataframes where each row in one dataframe matches two rows in other dataframe

I have tried to match two data frames on three columns, namely "ticker", "year", and "quarter". While the first dataset contains two observations (ie two rows) for each ticker, year, quarter, the second only contains one row... especially, the first dataframe always captures two different speakers in the same quarter while the second dataframe only captures firm information. Now, I want to match the dataframe so that each of the two rows of the first data frame within the same firm quarter are merged with the one relevant row in the second dataframe.

My data looks as follows: df

ticker  year    quarter exec_lname  jobposition
XX      2009    3       A           CEO
XX      2009    3       B           CFO
XX      2009    4       A           CEO
XX      2009    4       B           CFO
YY      2007    1       C           CEO
YY      2007    1       D           CFO
YY      2007    2       C           CEO
YY      2007    2       D           CFO
ZZ      2008    3       F           CEO
ZZ      2008    3       G           CFO

dfnew

ticker  year    quarter eps calldate
XX      2009    3       x   Mar
XX      2009    4       y   Apr
YY      2007    1       z   Feb
YY      2007    2       a   Jan
ZZ      2008    3       b   Dec

At the end, it should look like this:

ticker  year    quarter exec_lname  jobposition eps calldate
XX      2009    3       A           CEO         x   Mar
XX      2009    3       B           CFO         x   Mar
XX      2009    4       A           CEO         y   Apr
XX      2009    4       B           CFO         y   Apr
YY      2007    1       C           CEO         z   Feb
YY      2007    1       D           CFO         z   Feb
YY      2007    2       C           CEO         a   Jan
YY      2007    2       D           CFO         a   Jan
ZZ      2008    3       F           CEO         b   Dec
ZZ      2008    3       G           CFO         b   Dec

I tried:

dfjoin = pd.merge(dfnew, df,  how='left', left_on=['ticker', "year", "quarter"], right_on = ['ticker', "year", "quarter"])

but it returns the new dataset with all the right rows and columns, yet with the columns eps and calldate completely filled with NaN. Might this be because I want to merge each row twice to df? The problem is not merging on more than one key - the problem might be that in the first dataframe I always have two rows with the same combination of ticker/year/quarter.

I hope someone can help me! Thank you! Julia

Use merge with default parameter how = 'inner'

df.merge(dfnew, on = ['ticker', 'year', 'quarter'])

You get

    ticker  year    quarter exec_lname  jobposition eps calldate
0   XX      2009    3       A           CEO         x   Mar
1   XX      2009    3       B           CFO         x   Mar
2   XX      2009    4       A           CEO         y   Apr
3   XX      2009    4       B           CFO         y   Apr
4   YY      2007    1       C           CEO         z   Feb
5   YY      2007    1       D           CFO         z   Feb
6   YY      2007    2       C           CEO         a   Jan
7   YY      2007    2       D           CFO         a   Jan
8   ZZ      2008    3       F           CEO         b   Dec
9   ZZ      2008    3       G           CFO         b   Dec

Just merge both data frames and specify the variable names in " by " section.

merge(df,dfnew,by=c("ticker", "year", "quarter"), all.x=TRUE)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM