合并两个带有两项索引的 Pandas 数据帧会返回非唯一键

Question

EDIT编辑

I wrote this post thinking that the issue was on merge() or join() , however the issue was on the results obtained from groupby() .我写这篇文章时认为问题出在merge()或join() ，但问题出在从groupby()获得的结果上。 If you found this post, there is a change that you're getting the same error for the same reason.如果你找到了这篇文章，那么你会因为同样的原因得到同样的错误。 Hence, I left the title unchanged.因此，我保持标题不变。

Original post原帖

I have two pandas dataframes that contain three columns each.我有两个 Pandas 数据框，每个数据框包含三列。 The types are:类型是：

A: category
B: uint32
C: uint32

I group them by the first two columns and apply a function, like this:我按前两列将它们分组并应用一个函数，如下所示：

df1 = df1.groupby(["A", "B"]).apply(my_function, meta={"Res_1":"uint32", "Res_2":"float32", "Res_3":"float32"})
df2 = df2.groupby(["A", "B"]).apply(my_function, meta={"Res_1":"uint32", "Res_2":"float32", "Res_3":"float32"})

The resulting two dataframes have three columns, and an index composed of two terms (originally, the A and B columns).生成的两个数据帧具有三列，以及一个由两个术语（最初是 A 和 B 列）组成的索引。 They look like this:它们看起来像这样：

                          Res_1       Res_2       Res_3
A        B                                   
chrA01   1                    0    0.000000    0.000000
         5001                 0    0.000000    0.000000
         35001             2656    0.967225   21.346008
         55001              261    1.000000   27.003832
chrC01   1                  131    0.411950    8.610687
...                         ...         ...         ...
         10001                0    0.000000    0.000000
chrA01   30001             1511    1.000000   25.416943
         90001             1407    1.000000   25.073915
chrC01   30001                0    0.000000    0.000000
         90001                0    0.000000    0.000000

I then want to merge them into one dataframe, using a union of the df1 and df2 index, so I use the how="outer" option on=["A", "B"] .然后我想使用df1和df2索引的联合将它们合并到一个数据帧中，所以我使用how="outer"选项on=["A", "B"] 。

df = pd.merge(df1, df2, how="outer", on=["A", "B"], validate="one_to_one")

However, I get this error since I am doing validate="one_to_one" :但是，由于我正在执行validate="one_to_one" ，因此出现此错误：

pandas.errors.MergeError: Merge keys are not unique in either left or right dataset; not a one-to-one merge

I know that the keys should be unique, because I have assessed the generation of the two dataframes and their content.我知道键应该是唯一的，因为我已经评估了两个数据框的生成及其内容。

Maybe I am doing the merge() wrongly?也许我在做merge()错误？ My suspect is on the way I specify the on=... option.我的嫌疑人正在指定on=...选项。 Is there a way I can specify on=index even if it is an index with two terms?有没有一种方法可以指定on=index即使它是具有两个术语的索引？

Answer 1

After the suggestions to look into the indexes and unique indexes, I found the issue.在查看索引和唯一索引的建议之后，我发现了问题。 When performing groupby() on both A and B , the function called with apply() returned one line with the right results and one full of NaN values.在A和B上执行groupby()时，使用apply()调用的函数返回一行，其中包含正确的结果和一个完整的NaN值。 The reason is yet to be determined.原因尚待确定。

Due to a weird output sorting, these two outputs were not one after the other in the dataframes.由于奇怪的输出排序，这两个输出在数据帧中不是一个接一个。 Hence I did not see the second NaN lines when writing this post.因此，我在写这篇文章时没有看到第二行NaN 。

After generating the dataframes, I now run a df.dropna(how="all") for each and the duplicated indexes are gone.生成数据帧后，我现在为每个运行df.dropna(how="all")并且重复的索引消失了。 I feel like this is not a clean solution, as those NaN lines should not even be there in the first place, but for now I found this patch.我觉得这不是一个干净的解决方案，因为那些NaN行一开始就不应该存在，但现在我找到了这个补丁。

合并两个带有两项索引的 Pandas 数据帧会返回非唯一键

问题描述

1 个解决方案

解决方案1
0 已采纳 2020-04-01 13:18:43

合并两个带有两项索引的 Pandas 数据帧会返回非唯一键

问题描述

1 个解决方案

解决方案1 0 已采纳 2020-04-01 13:18:43

解决方案1
0 已采纳 2020-04-01 13:18:43