[英]Removing Duplicates from an array in Python depending on the first 4 letters
I have a list of postcodes, eg 我有一个邮政编码列表,例如
DD1 1DB
DD1 5PH
DD10 8JG
DD10 9LJ
What I would like to do is keep the first representative, depending on the first part of the postcode eg 我想做的是保留第一位代表,具体取决于邮政编码的第一部分,例如
I need to keep: 我需要保持:
DD1 1DB
DD10 8JG
I am using pandas and imported the file.csv containing column POSTCODES as: 我正在使用熊猫并将包含列POSTCODES的file.csv导入为:
df = pandas.read_csv('file.csv')
pc = df.POSTCODES
Now I am completely stuck. 现在我完全被困住了。 I managed to get it working using Excel (is this the better option?) but I would like to learn python and decided to see if I can do it in python. 我设法使用Excel使其工作(这是更好的选择吗?),但我想学习python,并决定看看是否可以在python中进行操作。
You could use df['POSTCODES'].str[:4]
to obtain the first four characters, and use the duplicated
method to identify duplicates: 您可以使用df['POSTCODES'].str[:4]
获取前四个字符,然后使用duplicated
方法来识别重复项:
In [89]: df.loc[~df['POSTCODES'].str[:4].duplicated(keep='first')]
Out[89]:
POSTCODES
0 DD1 1DB
2 DD10 8JG
Since duplicated(keep='first')
marks duplicates as True, the row we wish to keep would be marked False. 由于duplicated(keep='first')
重复项标记为True,因此我们希望保留的行将标记为False。 So to select the False rows with df.loc
, the ~
is used to invert the boolean Series . 因此,要使用df.loc
选择False行, ~
用于反转布尔系列 。
pc[~pc.POSTCODES.str.split(' ', expand=True)[0].duplicated()]
OR as piRSquared suggests in the comments: 或piRSquared在评论中建议:
pc[~pc.POSTCODES.str.split().str[0].duplicated()]
Output: 输出:
POSTCODES
0 DD1 1DB
2 DD10 8JG
In[24]: f = '''\
...: DD1 1DB
...: DD1 5PH
...: DD10 8JG
...: DD10 9LJ'''.split('\n')
In[25]: d = {}
...: for line in f:
...: left, right = line.split()
...: if left not in d:
...: d[left] = right
...:
In[26]: d
Out[26]: {'DD1': '1DB', 'DD10': '8JG'}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.