简体   繁体   English

在Python中根据前4个字母从数组中删除重复项

[英]Removing Duplicates from an array in Python depending on the first 4 letters

I have a list of postcodes, eg 我有一个邮政编码列表,例如

DD1 1DB
DD1 5PH
DD10 8JG
DD10 9LJ

What I would like to do is keep the first representative, depending on the first part of the postcode eg 我想做的是保留第一位代表,具体取决于邮政编码的第一部分,例如

I need to keep: 我需要保持:

DD1 1DB
DD10 8JG

I am using pandas and imported the file.csv containing column POSTCODES as: 我正在使用熊猫并将包含列POSTCODES的file.csv导入为:

df = pandas.read_csv('file.csv')
pc = df.POSTCODES

Now I am completely stuck. 现在我完全被困住了。 I managed to get it working using Excel (is this the better option?) but I would like to learn python and decided to see if I can do it in python. 我设法使用Excel使其工作(这是更好的选择吗?),但我想学习python,并决定看看是否可以在python中进行操作。

You could use df['POSTCODES'].str[:4] to obtain the first four characters, and use the duplicated method to identify duplicates: 您可以使用df['POSTCODES'].str[:4]获取前四个字符,然后使用duplicated方法来识别重复项:

In [89]: df.loc[~df['POSTCODES'].str[:4].duplicated(keep='first')]
Out[89]: 
  POSTCODES
0   DD1 1DB
2  DD10 8JG

Since duplicated(keep='first') marks duplicates as True, the row we wish to keep would be marked False. 由于duplicated(keep='first')重复项标记为True,因此我们希望保留的行将标记为False。 So to select the False rows with df.loc , the ~ is used to invert the boolean Series . 因此,要使用df.loc选择False行, ~用于反转布尔系列

pc[~pc.POSTCODES.str.split(' ', expand=True)[0].duplicated()]

OR as piRSquared suggests in the comments: 或piRSquared在评论中建议:

pc[~pc.POSTCODES.str.split().str[0].duplicated()]

Output: 输出:

  POSTCODES
0   DD1 1DB
2  DD10 8JG
In[24]: f = '''\
   ...: DD1 1DB
   ...: DD1 5PH
   ...: DD10 8JG
   ...: DD10 9LJ'''.split('\n')
In[25]: d = {}
   ...: for line in f:
   ...:     left, right = line.split()
   ...:     if left not in d:
   ...:         d[left] = right
   ...: 
In[26]: d
Out[26]: {'DD1': '1DB', 'DD10': '8JG'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM