简体   繁体   English

Pandas:如何删除重复但保留某些行值的列值

[英]Pandas: How to drop column values that are duplicates but keep certain row values

I have a Pandas dataframe that have duplicate names but with different values, and I want to remove the duplicate names but keep the rows.我有一个 Pandas dataframe 具有重复的名称但具有不同的值,我想删除重复的名称但保留行。 A snippet of my dataframe looks like this:我的 dataframe 的片段如下所示: 在此处输入图像描述

And my desired output would look like this:我想要的 output 看起来像这样:

在此处输入图像描述

I've tried using the builtin pandas function.drop_duplicates(), but I end up deleting all duplicates and their respective rows.我试过使用内置的 pandas function.drop_duplicates(),但我最终删除了所有重复项及其各自的行。 My current code looks like this:我当前的代码如下所示:

df = pd.read_csv("merged_db.csv", encoding = "unicode_escape", chunksize=50000)
df = pd.concat(df, ignore_index=True)
df2 = df.drop_duplicates(subset=['auth_given_name', 'auth_surname'])

and this is output I am currently getting:这是 output 我目前得到:

在此处输入图像描述

Basically, I want to return all the values of the coauthor but remove all duplicate data of the original author.基本上,我想返回共同作者的所有值,但删除原作者的所有重复数据。 My question is what is the best way to achieve the output that I want.我的问题是实现我想要的 output 的最佳方法是什么。 I tried using the subset parameter but I don't believe I'm using it correctly.I also found a similar post , but I couldn't really apply it to python.我尝试使用子集参数,但我不相信我使用它正确。我还发现了一个类似的帖子,但我无法真正将它应用于 python。 Thank you for your time!感谢您的时间!

You may consider this code您可以考虑此代码

df = pd.read_csv("merged_db.csv", encoding = "unicode_escape", chunksize=50000)
first_author = df.columns[:24]
df.loc[df.duplicated(first_author), first_author] = np.empty(len(first_author))
print(df)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM