[英]Removing observations that have the same column values for a non unique id
I have a dataframe that has "tag information" on different companies for both iPad and Tablet platforms.我有一个 dataframe,其中包含 iPad 和平板电脑平台的不同公司的“标签信息”。 Each "experiment" has an id which can occur multiple times depending on how many tags the experiment has.
每个“实验”都有一个 id,它可以根据实验有多少标签出现多次。 Experiments can be on iPad or Tablet (type), but i want to remove all of the duplicate experiments (the same experiment that appears in both iPad and Tablet).
实验可以在 iPad 或平板电脑(类型)上进行,但我想删除所有重复的实验(在 iPad 和平板电脑中出现的相同实验)。 An experiment is a duplicate if it's from the same company and has the exact same tags.
如果实验来自同一家公司并且具有完全相同的标签,则该实验是重复的。 For example in the following dataframe Netflix is a duplicate because it has the same tags (Includes dropdown, Includes product list) for both iPad and Tablet.
例如,在下面的 dataframe 中,Netflix 是重复的,因为它对于 iPad 和平板电脑具有相同的标签(包括下拉菜单,包括产品列表)。 So either the tablet version or iPad version should be removed.
所以应该删除平板版本或 iPad 版本。
Input:输入:
id company type tag
1 Netflix iPad Includes dropdown
1 Netflix iPad Includes product list
2 Netflix Tablet Includes dropdown
2 Netflix Tablet Includes product list
3 Apple iPad Includes images
4 Apple Tablet Includes images
Output: Output:
id company type tag
2 Netflix Tablet Includes dropdown
2 Netflix Tablet Includes product list
3 Apple iPad Includes images
4 Apple Tablet Includes images
I'm looking for a solution in pandas python.我正在 pandas python 中寻找解决方案。 How can i do this?
我怎样才能做到这一点?
I've tried this我试过这个
df.drop_duplicates(subset=['tag'], keep='last')
But i dont think solution works beacuse theres a possibility that there might be another experiment that is a different company but it contains the same tags.但我不认为解决方案有效,因为可能会有另一个实验是不同的公司,但它包含相同的标签。 Therefore it will delete this instance even though it's not considered a duplicate.
因此,即使它不被视为重复,它也会删除此实例。
Basically i want to drop ids that have the same tag for the same company.基本上我想为同一家公司删除具有相同标签的 ID。
I think you just need to add company name into your subset parameter.我认为您只需将公司名称添加到您的子集参数中。 Let's build a dataframe you want:
让我们构建一个你想要的 dataframe:
id = [1, 1, 2, 2, 3, 4]
company = ['Netflix']*4 + ['Apple'] + ['New']
type = ['iPad', 'iPad', 'Tablet', 'Tablet', 'iPad', 'Tablet']
tag = ['Includes dropdown', 'Includes product list']*2 + ['Includes images']*2
data = {'id':id, 'company': company, 'type':type, 'tag':tag}
df = pd.DataFrame(data)
Print df and here is the dataframe:打印 df,这里是 dataframe:
You see the id 3 and 4 have the same tag but different company names, like you mentioned, if we just use the code you tried:您会看到 id 3 和 4 具有相同的标签但不同的公司名称,就像您提到的那样,如果我们只使用您尝试过的代码:
df.drop_duplicates(subset=['tag'], keep='last')
We will get this:我们会得到这个:
In the above figure, id 3 was deleted which is what you want to avoid.在上图中,删除了 id 3,这是您要避免的。 However, if we just add company to subset:
但是,如果我们只是将公司添加到子集:
df.drop_duplicates(subset=['company', 'tag'], keep='last')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.