消除 Pandas Dataframe 中的列重复

Question

I have a data frames where I am trying to find all possible combinations of itself and a fraction of itself.我有一个数据框，我试图在其中找到自身和自身一部分的所有可能组合。 The following data frames is a much scaled down version of the one I am running.以下数据帧是我正在运行的数据帧的缩小版本。 The first data frame (fruit1) is a fraction of the second data frame (fruit2).第一个数据帧 (fruit1) 是第二个数据帧 (fruit2) 的一部分。

FruitSubDF     FruitFullDF
apple           apple
cherry          cherry
banana          banana
                peach
                 plum

By running the following code通过运行以下代码

 df1 = pd.DataFrame(list(product(fruitDF.iloc[0:3,0], fruitDF.iloc[0:5,0])), columns=['fruit1', 'fruit2'])

the output is output 是

    Fruit1 Fruit2
0    apple  banana
1    apple  apple
2    apple  cherry
3    apple  peach
4    apple  plum
5   cherry banana
6   cherry apple
7   cherry cherry
.
.
18   banana banana
19   banana peach
20   banana plum

My problem is I want to remove elements with the same two fruits regardless of which fruit is in which column as below.我的问题是我想删除具有相同两个水果的元素，无论哪个水果在下面的哪一列中。 So I am considering (apple,cherry) and (cherry,apple) as the same but I am unsure of an efficient way instead of iterRows to weed out the unwanted data as most pandas functions I find will remove based on the order.因此，我正在考虑将 (apple,cherry) 和 (cherry,apple) 视为相同，但我不确定是否有一种有效的方法而不是 iterRows 来清除不需要的数据，因为我发现的大多数 pandas 函数将根据订单删除。

    Fruit1 Fruit2
 0   apple banana
 1   apple cherry
 2   apple apple
 3   apple peach
 4   apple plum
 5  cherry banana
 6  cherry cherry
 .
 .
 15  banana plum

Answer 1

First, I created a piece of code to replicate your DataFrame.首先，我创建了一段代码来复制您的 DataFrame。 I took my code here: stack overflow我在这里拿了我的代码：堆栈溢出

import pandas as pd


Fruit1=['apple', 'cherry', 'banana']
Fruit2=['banana', 'apple', 'cherry']



index = pd.MultiIndex.from_product([Fruit1, Fruit2], names = ["Fruit1", "Fruit2"])

df = pd.DataFrame(index = index).reset_index()

Then, you can use the lexicographial order to filter the dataframe.然后，您可以使用字典顺序过滤 dataframe。

df[df['Fruit1']<=df['Fruit2']]

I have the result you wanted to obtain.我有你想要的结果。

EDIT: you edited your post but it seems to still do the job.编辑：您编辑了您的帖子，但它似乎仍然可以完成这项工作。

Answer 2

You can use itertools to achieve it -您可以使用 itertools 来实现它 -

import itertools
fruits  = ['banana', 'cherry',  'apple']
pd.DataFrame((itertools.permutations(fruits, 2)), columns=['fruit1', 'fruit2'])

消除 Pandas Dataframe 中的列重复

问题描述

2 个解决方案

解决方案1
1 已采纳 2020-08-05 18:23:10

解决方案2
0 2020-08-05 17:54:56

消除 Pandas Dataframe 中的列重复

问题描述

2 个解决方案

解决方案1 1 已采纳 2020-08-05 18:23:10

解决方案2 0 2020-08-05 17:54:56

解决方案1
1 已采纳 2020-08-05 18:23:10

解决方案2
0 2020-08-05 17:54:56