[英]Merging two pandas dataframes with two-term index returns non-unique keys
EDIT编辑
I wrote this post thinking that the issue was on merge()
or join()
, however the issue was on the results obtained from groupby()
.我写这篇文章时认为问题出在merge()
或join()
,但问题出在从groupby()
获得的结果上。 If you found this post, there is a change that you're getting the same error for the same reason.如果你找到了这篇文章,那么你会因为同样的原因得到同样的错误。 Hence, I left the title unchanged.因此,我保持标题不变。
Original post原帖
I have two pandas dataframes that contain three columns each.我有两个 Pandas 数据框,每个数据框包含三列。 The types are:类型是:
A: category
B: uint32
C: uint32
I group them by the first two columns and apply a function, like this:我按前两列将它们分组并应用一个函数,如下所示:
df1 = df1.groupby(["A", "B"]).apply(my_function, meta={"Res_1":"uint32", "Res_2":"float32", "Res_3":"float32"})
df2 = df2.groupby(["A", "B"]).apply(my_function, meta={"Res_1":"uint32", "Res_2":"float32", "Res_3":"float32"})
The resulting two dataframes have three columns, and an index composed of two terms (originally, the A and B columns).生成的两个数据帧具有三列,以及一个由两个术语(最初是 A 和 B 列)组成的索引。 They look like this:它们看起来像这样:
Res_1 Res_2 Res_3
A B
chrA01 1 0 0.000000 0.000000
5001 0 0.000000 0.000000
35001 2656 0.967225 21.346008
55001 261 1.000000 27.003832
chrC01 1 131 0.411950 8.610687
... ... ... ...
10001 0 0.000000 0.000000
chrA01 30001 1511 1.000000 25.416943
90001 1407 1.000000 25.073915
chrC01 30001 0 0.000000 0.000000
90001 0 0.000000 0.000000
I then want to merge them into one dataframe, using a union of the df1
and df2
index, so I use the how="outer"
option on=["A", "B"]
.然后我想使用df1
和df2
索引的联合将它们合并到一个数据帧中,所以我使用how="outer"
选项on=["A", "B"]
。
df = pd.merge(df1, df2, how="outer", on=["A", "B"], validate="one_to_one")
However, I get this error since I am doing validate="one_to_one"
:但是,由于我正在执行validate="one_to_one"
,因此出现此错误:
pandas.errors.MergeError: Merge keys are not unique in either left or right dataset; not a one-to-one merge
I know that the keys should be unique, because I have assessed the generation of the two dataframes and their content.我知道键应该是唯一的,因为我已经评估了两个数据框的生成及其内容。
Maybe I am doing the merge()
wrongly?也许我在做merge()
错误? My suspect is on the way I specify the on=...
option.我的嫌疑人正在指定on=...
选项。 Is there a way I can specify on=index
even if it is an index with two terms?有没有一种方法可以指定on=index
即使它是具有两个术语的索引?
After the suggestions to look into the indexes and unique indexes, I found the issue.在查看索引和唯一索引的建议之后,我发现了问题。 When performing groupby()
on both A
and B
, the function called with apply()
returned one line with the right results and one full of NaN
values.在A
和B
上执行groupby()
时,使用apply()
调用的函数返回一行,其中包含正确的结果和一个完整的NaN
值。 The reason is yet to be determined.原因尚待确定。
Due to a weird output sorting, these two outputs were not one after the other in the dataframes.由于奇怪的输出排序,这两个输出在数据帧中不是一个接一个。 Hence I did not see the second NaN
lines when writing this post.因此,我在写这篇文章时没有看到第二行NaN
。
After generating the dataframes, I now run a df.dropna(how="all")
for each and the duplicated indexes are gone.生成数据帧后,我现在为每个运行df.dropna(how="all")
并且重复的索引消失了。 I feel like this is not a clean solution, as those NaN
lines should not even be there in the first place, but for now I found this patch.我觉得这不是一个干净的解决方案,因为那些NaN
行一开始就不应该存在,但现在我找到了这个补丁。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.