简体   繁体   中英

Merging two pandas dataframes with two-term index returns non-unique keys

EDIT

I wrote this post thinking that the issue was on merge() or join() , however the issue was on the results obtained from groupby() . If you found this post, there is a change that you're getting the same error for the same reason. Hence, I left the title unchanged.

Original post

I have two pandas dataframes that contain three columns each. The types are:

A: category
B: uint32
C: uint32

I group them by the first two columns and apply a function, like this:

df1 = df1.groupby(["A", "B"]).apply(my_function, meta={"Res_1":"uint32", "Res_2":"float32", "Res_3":"float32"})
df2 = df2.groupby(["A", "B"]).apply(my_function, meta={"Res_1":"uint32", "Res_2":"float32", "Res_3":"float32"})

The resulting two dataframes have three columns, and an index composed of two terms (originally, the A and B columns). They look like this:

                          Res_1       Res_2       Res_3
A        B                                   
chrA01   1                    0    0.000000    0.000000
         5001                 0    0.000000    0.000000
         35001             2656    0.967225   21.346008
         55001              261    1.000000   27.003832
chrC01   1                  131    0.411950    8.610687
...                         ...         ...         ...
         10001                0    0.000000    0.000000
chrA01   30001             1511    1.000000   25.416943
         90001             1407    1.000000   25.073915
chrC01   30001                0    0.000000    0.000000
         90001                0    0.000000    0.000000

I then want to merge them into one dataframe, using a union of the df1 and df2 index, so I use the how="outer" option on=["A", "B"] .

df = pd.merge(df1, df2, how="outer", on=["A", "B"], validate="one_to_one")

However, I get this error since I am doing validate="one_to_one" :

pandas.errors.MergeError: Merge keys are not unique in either left or right dataset; not a one-to-one merge

I know that the keys should be unique, because I have assessed the generation of the two dataframes and their content.

Maybe I am doing the merge() wrongly? My suspect is on the way I specify the on=... option. Is there a way I can specify on=index even if it is an index with two terms?

After the suggestions to look into the indexes and unique indexes, I found the issue. When performing groupby() on both A and B , the function called with apply() returned one line with the right results and one full of NaN values. The reason is yet to be determined.

Due to a weird output sorting, these two outputs were not one after the other in the dataframes. Hence I did not see the second NaN lines when writing this post.

After generating the dataframes, I now run a df.dropna(how="all") for each and the duplicated indexes are gone. I feel like this is not a clean solution, as those NaN lines should not even be there in the first place, but for now I found this patch.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM