简体   繁体   中英

Inner and Outer merge in Pandas with indicator=True

Let's say that I have two dataframes df1 and df2 . I can do an inner and an outer join in this way:

inner_df = df1.merge(df2, how="inner", left_on=col_df1, right_on=col_df2)
outer_df = df1.merge(df2, how="outer", left_on=col_df1, right_on=col_df2)

The DataFrame.merge method allows you to use an indicator attribute: if True, a column is added to output DataFrame called "_merge" with information on the source of each row. This column takes on a value of “left_only” for observations whose merge key only appears in 'left' DataFrame, "right_only" for observations whose merge key only appears in 'right' DataFrame, and "both" if the observation's merge key is found in both.

I am not sure if I understood correctly what this attribute does. Here is my question: are these two pieces of code equivalent?

inner_df = df1.merge(df2, how="inner", left_on=col_df1, right_on=col_df2)
outer_df = df1.merge(df2, how="outer", left_on=col_df1, right_on=col_df2,
                     indicator=True)
inner_df = outer_df[outer_df['_merge'] == 'both'].drop(columns=["_merge"])

The two merges return the same rows . But not exactly the same dataframes. The differences are:

  1. inner_df2 has an additional column _merge column - ok if is trivial to get rid of it with ...drop(columns='_merge')
  2. The columns may have been populated with NaN values. If some have an integer type, they have been converted to a floating point type. It is normally not a major problem, because once you only select lines with no NaN values you can convert them back to their original type. It is a serious problem in one use case: if you have a numpy int64 type and values using more than 53 bits. In that case, the forth and back conversion will zero the least significant bits. That would lead to inacurate values it they represent some measures, or worse if they are identifiers.

Long story short: whether both are equivalent actually depend on the real use case...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM