EDIT
I wrote this post thinking that the issue was on merge()
or join()
, however the issue was on the results obtained from groupby()
. If you found this post, there is a change that you're getting the same error for the same reason. Hence, I left the title unchanged.
Original post
I have two pandas dataframes that contain three columns each. The types are:
A: category
B: uint32
C: uint32
I group them by the first two columns and apply a function, like this:
df1 = df1.groupby(["A", "B"]).apply(my_function, meta={"Res_1":"uint32", "Res_2":"float32", "Res_3":"float32"})
df2 = df2.groupby(["A", "B"]).apply(my_function, meta={"Res_1":"uint32", "Res_2":"float32", "Res_3":"float32"})
The resulting two dataframes have three columns, and an index composed of two terms (originally, the A and B columns). They look like this:
Res_1 Res_2 Res_3
A B
chrA01 1 0 0.000000 0.000000
5001 0 0.000000 0.000000
35001 2656 0.967225 21.346008
55001 261 1.000000 27.003832
chrC01 1 131 0.411950 8.610687
... ... ... ...
10001 0 0.000000 0.000000
chrA01 30001 1511 1.000000 25.416943
90001 1407 1.000000 25.073915
chrC01 30001 0 0.000000 0.000000
90001 0 0.000000 0.000000
I then want to merge them into one dataframe, using a union of the df1
and df2
index, so I use the how="outer"
option on=["A", "B"]
.
df = pd.merge(df1, df2, how="outer", on=["A", "B"], validate="one_to_one")
However, I get this error since I am doing validate="one_to_one"
:
pandas.errors.MergeError: Merge keys are not unique in either left or right dataset; not a one-to-one merge
I know that the keys should be unique, because I have assessed the generation of the two dataframes and their content.
Maybe I am doing the merge()
wrongly? My suspect is on the way I specify the on=...
option. Is there a way I can specify on=index
even if it is an index with two terms?
After the suggestions to look into the indexes and unique indexes, I found the issue. When performing groupby()
on both A
and B
, the function called with apply()
returned one line with the right results and one full of NaN
values. The reason is yet to be determined.
Due to a weird output sorting, these two outputs were not one after the other in the dataframes. Hence I did not see the second NaN
lines when writing this post.
After generating the dataframes, I now run a df.dropna(how="all")
for each and the duplicated indexes are gone. I feel like this is not a clean solution, as those NaN
lines should not even be there in the first place, but for now I found this patch.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.