Inner and Outer merge in Pandas with indicator=True

Question

Let's say that I have two dataframes df1 and df2 . I can do an inner and an outer join in this way:

inner_df = df1.merge(df2, how="inner", left_on=col_df1, right_on=col_df2)
outer_df = df1.merge(df2, how="outer", left_on=col_df1, right_on=col_df2)

The DataFrame.merge method allows you to use an indicator attribute: if True, a column is added to output DataFrame called "_merge" with information on the source of each row. This column takes on a value of “left_only” for observations whose merge key only appears in 'left' DataFrame, "right_only" for observations whose merge key only appears in 'right' DataFrame, and "both" if the observation's merge key is found in both.

I am not sure if I understood correctly what this attribute does. Here is my question: are these two pieces of code equivalent?

inner_df = df1.merge(df2, how="inner", left_on=col_df1, right_on=col_df2)

outer_df = df1.merge(df2, how="outer", left_on=col_df1, right_on=col_df2,
                     indicator=True)
inner_df = outer_df[outer_df['_merge'] == 'both'].drop(columns=["_merge"])

Answer 1

The two merges return the same rows . But not exactly the same dataframes. The differences are:

inner_df2 has an additional column _merge column - ok if is trivial to get rid of it with ...drop(columns='_merge')
The columns may have been populated with NaN values. If some have an integer type, they have been converted to a floating point type. It is normally not a major problem, because once you only select lines with no NaN values you can convert them back to their original type. It is a serious problem in one use case: if you have a numpy int64 type and values using more than 53 bits. In that case, the forth and back conversion will zero the least significant bits. That would lead to inacurate values it they represent some measures, or worse if they are identifiers.

Long story short: whether both are equivalent actually depend on the real use case...

Inner and Outer merge in Pandas with indicator=True

Question

1 answers

solution1
1 ACCPTED 2020-03-13 11:18:03

Inner and Outer merge in Pandas with indicator=True

Question

1 answers

solution1 1 ACCPTED 2020-03-13 11:18:03

solution1
1 ACCPTED 2020-03-13 11:18:03