在Python中根据前4个字母从数组中删除重复项

Question

I have a list of postcodes, eg 我有一个邮政编码列表，例如

DD1 1DB
DD1 5PH
DD10 8JG
DD10 9LJ

What I would like to do is keep the first representative, depending on the first part of the postcode eg 我想做的是保留第一位代表，具体取决于邮政编码的第一部分，例如

I need to keep: 我需要保持：

DD1 1DB
DD10 8JG

I am using pandas and imported the file.csv containing column POSTCODES as: 我正在使用熊猫并将包含列POSTCODES的file.csv导入为：

df = pandas.read_csv('file.csv')
pc = df.POSTCODES

Now I am completely stuck. 现在我完全被困住了。 I managed to get it working using Excel (is this the better option?) but I would like to learn python and decided to see if I can do it in python. 我设法使用Excel使其工作（这是更好的选择吗？），但我想学习python，并决定看看是否可以在python中进行操作。

Answer 1

You could use df['POSTCODES'].str[:4] to obtain the first four characters, and use the duplicated method to identify duplicates: 您可以使用df['POSTCODES'].str[:4]获取前四个字符，然后使用duplicated方法来识别重复项：

In [89]: df.loc[~df['POSTCODES'].str[:4].duplicated(keep='first')]
Out[89]: 
  POSTCODES
0   DD1 1DB
2  DD10 8JG

Since duplicated(keep='first') marks duplicates as True, the row we wish to keep would be marked False. 由于duplicated(keep='first')重复项标记为True，因此我们希望保留的行将标记为False。 So to select the False rows with df.loc , the ~ is used to invert the boolean Series . 因此，要使用df.loc选择False行， ~用于反转布尔系列。

Answer 2

pc[~pc.POSTCODES.str.split(' ', expand=True)[0].duplicated()]

OR as piRSquared suggests in the comments: 或piRSquared在评论中建议：

pc[~pc.POSTCODES.str.split().str[0].duplicated()]

Output: 输出：

  POSTCODES
0   DD1 1DB
2  DD10 8JG

Answer 3

In[24]: f = '''\
   ...: DD1 1DB
   ...: DD1 5PH
   ...: DD10 8JG
   ...: DD10 9LJ'''.split('\n')
In[25]: d = {}
   ...: for line in f:
   ...:     left, right = line.split()
   ...:     if left not in d:
   ...:         d[left] = right
   ...: 
In[26]: d
Out[26]: {'DD1': '1DB', 'DD10': '8JG'}

在Python中根据前4个字母从数组中删除重复项

问题描述

3 个解决方案

解决方案1
4 2017-06-13 20:28:24

解决方案2
2 2017-06-13 20:27:57

解决方案3
0 2017-06-13 20:37:51

在Python中根据前4个字母从数组中删除重复项

问题描述

3 个解决方案

解决方案1 4 2017-06-13 20:28:24

解决方案2 2 2017-06-13 20:27:57

解决方案3 0 2017-06-13 20:37:51

解决方案1
4 2017-06-13 20:28:24

解决方案2
2 2017-06-13 20:27:57

解决方案3
0 2017-06-13 20:37:51