简体   繁体   中英

Unable to delete the duplicates in CSV

"i have a data set in csv there it is a field name Episode where we will take data for future sport events we have"""INDIA VS PAKISTAN AND PAKISTAN VS INDIA for same date is there any option to delete the duplicate

Thanks in advance

在此处输入图像描述

One idea you could use would be to pandas rank method, group by the needed columns

df["RANK"] = df.groupby("Column_1")["Column_2"].rank(method="first", ascending=True)

This should return dataframe by grouping, so three rows of dupes should be ranked 1,2 and 3 respectively. From there, you can take the subset of the dataframe where rank=1 and this will give you a dataframe with no dupes.

Create a new match column then drop_duplicates

# sample df
df = pd.DataFrame({'a': [1,1,1,1,1],
                   'b': ['Bulldogs at Aztecs', 'Aztecs at Bulldogs', 'Bearcats at Huskies', 'Huskies at Bearcats', 'something else']})

# list comprehension and sort words in string 
df['match'] = [' '.join(sorted(x.split())) for x in df['b'].values]

#    a                    b                match
# 0  1   Bulldogs at Aztecs   Aztecs Bulldogs at
# 1  1   Aztecs at Bulldogs   Aztecs Bulldogs at
# 2  1  Bearcats at Huskies  Bearcats Huskies at
# 3  1  Huskies at Bearcats  Bearcats Huskies at
# 4  1       something else       else something

# drop_duplicates
df.drop_duplicates(['a', 'match'], keep='first').drop(columns='match')

#    a                    b
# 0  1   Bulldogs at Aztecs
# 2  1  Bearcats at Huskies
# 4  1       something else

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM