简体   繁体   English

无法使用pandas从.csv列中删除重复项

[英]Can't remove duplicates from .csv column with pandas

I'm trying to do something very simple to a .csv containing addresses. 我正在尝试为包含地址的.csv做一些非常简单的事情。 I want to use the pandas function drop_duplicates() to remove any rows if they contain a duplicate value in a singular column(['Addresses']). 我想使用pandas函数drop_duplicates()删除任何行,如果它们在单个列中包含重复值(['Addresses'])。

Whenever I try to using drop_duplicates() and print or save my data frame to a new .csv, the duplicate rows/values are still there. 每当我尝试使用drop_duplicates()并将我的数据帧打印或保存到新的.csv时,重复的行/值仍然存在。


data = pandas.read_csv(r"C:\Users\markbrd\Desktop\PalmAveAddresses.csv",
encoding = "ISO-8859-1")

data.drop_duplicates(subset=['Addresses'], keep='first')

print(data['Addresses'])

results: 结果:

0             4834Via Estrella
1             5244Via Patricia
2        11721HIDDEN VALLEY RD
3                  30GARDEN CT
4      1999Fremont Blvd. Bldg.
5          8316Fountainhead Ct
6          8312Fountainhead Ct
7               1013Adella Ave
8               1005Adella Ave
9                 1520Tenth St
10                1536Tenth St

                ...           

607              847Florida St
608                 81212th St
609                 81212th St
610                 81212th St
611                 81212th St
612                 81212th St
613                 81212th St
614                 81212th St
615                 81212th St
616                 81212th St
617                 81212th St
618                 81212th St
619                 81212th St

As you can see, there are still several rows that contain duplicates in Addresses (see rows 609-619). 如您所见,仍然有几行在地址中包含重复项(请参阅行609-619)。 Any help would be greatly appreciated! 任何帮助将不胜感激!

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False)

Return DataFrame with duplicate rows removed, optionally only considering certain columns 返回删除了重复行的DataFrame,可选择仅考虑某些列

Parameters: subset : column label or sequence of labels, optional 参数: subset:列标签或标签序列,可选

Only consider certain columns for identifying duplicates, by default use all of the columns 仅考虑用于标识重复项的某些列,默认情况下使用所有列

keep : {'first', 'last', False}, default 'first' 保持:{'first','last',False},默认'first'

first : Drop duplicates except for the first occurrence. first:删除第一次出现的重复项。 last : Drop duplicates except for the last occurrence. last:删除重复项,除了最后一次出现。 False : Drop all duplicates. 错误:删除所有重复项。 inplace : boolean, default False inplace:布尔值,默认为False

Whether to drop duplicates in place or to return a copy 是否删除重复项或返回副本

Returns: 返回:
deduplicated : DataFrame 重复数据删除:DataFrame

您需要分配或使用就地。

data.drop_duplicates(subset=['Addresses'], keep='first', inplace=True)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM