如何生成两个数据框之间未共享的项目列表

Question

基本上，我有一堆按颜色（项目）分类的独特项目的单一列表。 我做了一些事情并生成了一个 dataframe 与这些独特项目的选定组合（组合）。 我的目标是列出原始列表中未出现在组合 dataframe 中的项目。 理想情况下，我想检查所有四个颜色列，但对于我的初始测试，我只选择了“红色”列。

import pandas as pd

Items = pd.DataFrame({'Id': ["6917529336306454104","6917529268375577150","6917529175831101427","6917529351156928903","6917529249201580539","6917529246740186376","6917529286870790429","6917529212665335174","6917529206310658443","6917529207434353786","6917529309798817021","6917529352287607192","6917529268327711171","6917529316674574229"
],'Type': ['Red','Blue','Green','Cyan','Red','Blue','Blue','Blue','Blue','Green','Green','Green','Cyan','Cyan']})

Items = Items.set_index('Id', drop=True)

#Do stuff

Combinations = pd.DataFrame({
    'Red':  ["6917529336306454104","6917529336306454104","6917529336306454104","6917529336306454104"],
    'Blue': ["6917529268375577150","6917529286870790429","6917529206310658443","6917529206310658443"],
    'Green': ["6917529175831101427","6917529207434353786","6917529309798817021","6917529309798817021"],
    'Cyan': ["6917529351156928903","6917529268327711171","6917529351156928903","6917529268327711171"],
    'Other': [12,15,18,32]
})

我的第一次尝试是使用下面的行，但这会引发执行错误“KeyError：'Id'”。 一个论坛帖子表明set_index中的drop=True可能会解决它，但这在我的情况下似乎不起作用。

UnusedItems = ~Items[Items['Id'].isin(list(Combinations['Red']))]

我试图通过使用这条线来解决它。 在执行时，它会生成一个空的 dataframe。 仅通过检查，在考虑“红色”列时应返回项目 6917529249201580539。 考虑到所有组合列，项目 6917529249201580539、6917529246740186376、6917529212665335174 和 6917529316674574229 应作为未使用返回。

UnusedItems = ~Items[Items.iloc[:,0].isin(list(Combinations['Red']))]

我会很感激和想法或指导。 谢谢。

Answer 1

一种选择是从Combinations with iloc中获取前 4 列，并使用stack重新格式化为长格式：

(Combinations.iloc[:, :4].stack()
 .droplevel(0).rename_axis(index='Type').reset_index(name='Id'))

     Type                   Id
0     Red  6917529336306454104
1    Blue  6917529268375577150
2   Green  6917529175831101427
3    Cyan  6917529351156928903
4     Red  6917529336306454104
5    Blue  6917529286870790429
6   Green  6917529207434353786
7    Cyan  6917529268327711171
8     Red  6917529336306454104
9    Blue  6917529206310658443
10  Green  6917529309798817021
11   Cyan  6917529351156928903
12    Red  6917529336306454104
13   Blue  6917529206310658443
14  Green  6917529309798817021
15   Cyan  6917529268327711171

然后使用Items执行 Anti-Join， reset_index以从索引中取回“Id”列， merge指示符合并，并query以过滤掉两个帧中存在的值，然后drop指示符列：

UnusedItems = Items.reset_index().merge(
    Combinations.iloc[:, :4].stack()
        .droplevel(0).rename_axis(index='Type').reset_index(name='Id'),
    how='outer',
    indicator='I').query('I != "both"').drop('I', 1)

UnusedItems ：

                     Id   Type
8   6917529249201580539    Red
9   6917529246740186376   Blue
11  6917529212665335174   Blue
17  6917529352287607192  Green
20  6917529316674574229   Cyan

Answer 2

在组合上使用.melt() ，然后将两者都更改为集合并减去

set(Items.index) - set(Combinations.melt().value)

如何生成两个数据框之间未共享的项目列表

问题描述

2 个解决方案

解决方案1
1 2021-06-06 22:44:02

解决方案2
1 已采纳 2021-06-07 05:09:35

如何生成两个数据框之间未共享的项目列表

问题描述

2 个解决方案

解决方案1 1 2021-06-06 22:44:02

解决方案2 1 已采纳 2021-06-07 05:09:35

解决方案1
1 2021-06-06 22:44:02

解决方案2
1 已采纳 2021-06-07 05:09:35