简体   繁体   English

删除对于非唯一 id 具有相同列值的观察

[英]Removing observations that have the same column values for a non unique id

I have a dataframe that has "tag information" on different companies for both iPad and Tablet platforms.我有一个 dataframe,其中包含 iPad 和平板电脑平台的不同公司的“标签信息”。 Each "experiment" has an id which can occur multiple times depending on how many tags the experiment has.每个“实验”都有一个 id,它可以根据实验有多少标签出现多次。 Experiments can be on iPad or Tablet (type), but i want to remove all of the duplicate experiments (the same experiment that appears in both iPad and Tablet).实验可以在 iPad 或平板电脑(类型)上进行,但我想删除所有重复的实验(在 iPad 和平板电脑中出现的相同实验)。 An experiment is a duplicate if it's from the same company and has the exact same tags.如果实验来自同一家公司并且具有完全相同的标签,则该实验是重复的。 For example in the following dataframe Netflix is a duplicate because it has the same tags (Includes dropdown, Includes product list) for both iPad and Tablet.例如,在下面的 dataframe 中,Netflix 是重复的,因为它对于 iPad 和平板电脑具有相同的标签(包括下拉菜单,包括产品列表)。 So either the tablet version or iPad version should be removed.所以应该删除平板版本或 iPad 版本。

Input:输入:

id  company   type       tag
1   Netflix   iPad       Includes dropdown
1   Netflix   iPad       Includes product list
2   Netflix   Tablet     Includes dropdown
2   Netflix   Tablet     Includes product list
3   Apple     iPad       Includes images
4   Apple     Tablet     Includes images

Output: Output:

id  company   type       tag
2   Netflix   Tablet     Includes dropdown
2   Netflix   Tablet     Includes product list
3   Apple     iPad       Includes images
4   Apple     Tablet     Includes images

I'm looking for a solution in pandas python.我正在 pandas python 中寻找解决方案。 How can i do this?我怎样才能做到这一点?

I've tried this我试过这个

df.drop_duplicates(subset=['tag'], keep='last')

But i dont think solution works beacuse theres a possibility that there might be another experiment that is a different company but it contains the same tags.但我不认为解决方案有效,因为可能会有另一个实验是不同的公司,但它包含相同的标签。 Therefore it will delete this instance even though it's not considered a duplicate.因此,即使它不被视为重复,它也会删除此实例。

Basically i want to drop ids that have the same tag for the same company.基本上我想为同一家公司删除具有相同标签的 ID。

I think you just need to add company name into your subset parameter.我认为您只需将公司名称添加到您的子集参数中。 Let's build a dataframe you want:让我们构建一个你想要的 dataframe:

id = [1, 1, 2, 2, 3, 4]
company = ['Netflix']*4 + ['Apple'] + ['New']
type = ['iPad', 'iPad', 'Tablet', 'Tablet', 'iPad', 'Tablet']
tag = ['Includes dropdown', 'Includes product list']*2 + ['Includes images']*2
data = {'id':id, 'company': company, 'type':type, 'tag':tag}
df = pd.DataFrame(data)

Print df and here is the dataframe:打印 df,这里是 dataframe: 在此处输入图像描述

You see the id 3 and 4 have the same tag but different company names, like you mentioned, if we just use the code you tried:您会看到 id 3 和 4 具有相同的标签但不同的公司名称,就像您提到的那样,如果我们只使用您尝试过的代码:

df.drop_duplicates(subset=['tag'], keep='last')

We will get this:我们会得到这个:

在此处输入图像描述

In the above figure, id 3 was deleted which is what you want to avoid.在上图中,删除了 id 3,这是您要避免的。 However, if we just add company to subset:但是,如果我们只是将公司添加到子集:

df.drop_duplicates(subset=['company', 'tag'], keep='last')

We will get what you want:我们会得到你想要的: 在此处输入图像描述

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 删除所有观察值具有相同值的列会影响我的模型吗? - Will removing a column having same values for all observations affect my model? 查找与另一个数据帧中的列具有相同非唯一列值的数据帧行 - Find rows of a dataframe that have same non-unique column values as a column in another dataframe dataframe 有一个值列和唯一 ID 的列表,没有 - dataframe to have a list of values column and unique id without 在 Pandas DataFrame 中查找具有相同索引的一列中的唯一值 - Find unique values in one column that have the same index in Pandas DataFrame 计算至少有一个非 null 响应的列值的数量(列的唯一值的数量) - Count the number of column values (number of unique values of column) that have at least one non null response 从Pandas Dataframe中找到列中的唯一值,然后查看这些值在另一列中是否具有相同的值 - From Pandas Dataframe find unique values in column and see if those values have the same values in another column 从 Python 中的列中删除多个唯一值 - Removing Multiple Unique Values from Column in Python 将 df 列中的唯一值组织成键,并将同一行中的值复制为它们的值 - Organize unique values in a df column into keys and have the values in the same row copied as their values 使用 PySpark 加入两个数据帧。 我在单独的 DF 中有一个 unique_id 和一个 non_unique_id 列。 如何通过 unique_id 过滤非唯一列? - Using PySpark join on two dataframes. I have one unique_id and one non_unique_id column in separate DF. How to filter non-unique column by unique_id? Groupby 两列值并创建唯一 id - Groupby two column values and create a unique id
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM