简体   繁体   English

无法删除 CSV 中的重复项

[英]Unable to delete the duplicates in CSV

"i have a data set in csv there it is a field name Episode where we will take data for future sport events we have"""INDIA VS PAKISTAN AND PAKISTAN VS INDIA for same date is there any option to delete the duplicate “我在 csv 中有一个数据集,它有一个字段名称 Episode ,我们将在其中为我们拥有的未来体育赛事获取数据”“”同一日期的印度 VS 巴基斯坦和巴基斯坦 VS 印度是否有删除重复项的选项

Thanks in advance提前致谢

在此处输入图像描述

One idea you could use would be to pandas rank method, group by the needed columns您可以使用的一个想法是 pandas 排名方法,按所需的列分组

df["RANK"] = df.groupby("Column_1")["Column_2"].rank(method="first", ascending=True)

This should return dataframe by grouping, so three rows of dupes should be ranked 1,2 and 3 respectively.这应该通过分组返回 dataframe,所以三行骗子应该分别排名 1,2 和 3。 From there, you can take the subset of the dataframe where rank=1 and this will give you a dataframe with no dupes.从那里,您可以获取 dataframe 的子集,其中rank=1 ,这将为您提供 dataframe 没有欺骗。

Create a new match column then drop_duplicates创建一个新的匹配列然后drop_duplicates

# sample df
df = pd.DataFrame({'a': [1,1,1,1,1],
                   'b': ['Bulldogs at Aztecs', 'Aztecs at Bulldogs', 'Bearcats at Huskies', 'Huskies at Bearcats', 'something else']})

# list comprehension and sort words in string 
df['match'] = [' '.join(sorted(x.split())) for x in df['b'].values]

#    a                    b                match
# 0  1   Bulldogs at Aztecs   Aztecs Bulldogs at
# 1  1   Aztecs at Bulldogs   Aztecs Bulldogs at
# 2  1  Bearcats at Huskies  Bearcats Huskies at
# 3  1  Huskies at Bearcats  Bearcats Huskies at
# 4  1       something else       else something

# drop_duplicates
df.drop_duplicates(['a', 'match'], keep='first').drop(columns='match')

#    a                    b
# 0  1   Bulldogs at Aztecs
# 2  1  Bearcats at Huskies
# 4  1       something else

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM