简体   繁体   English

根据每个子列表中的第三项删除列表中的重复项

[英]Remove duplicates in a list of lists based on the third item in each sublist

I have a list of lists that looks like: 我有一个列表列表,看起来像:

c = [['470', '4189.0', 'asdfgw', 'fds'],
     ['470', '4189.0', 'qwer', 'fds'],
     ['470', '4189.0', 'qwer', 'dsfs fdv'] 
      ...]

c has about 30,000 interior lists. c有大约30,000个内部列表。 What I'd like to do is eliminate duplicates based on the 4th item on each interior list. 我想做的是根据每个内部列表的第4个项目消除重复项。 So the list of lists above would look like: 因此,上面的列表列表如下所示:

c = [['470', '4189.0', 'asdfgw', 'fds'],['470', '4189.0', 'qwer', 'dsfs fdv'] ...]

Here is what I have so far: 这是我到目前为止的内容:

d = [] #list that will contain condensed c
d.append(c[0]) #append first element, so I can compare lists
for bact in c: #c is my list of lists with 30,000 interior list
    for items in d:
        if bact[3] != items[3]:
            d.append(bact)  

I think this should work, but it just runs and runs. 我认为这应该可行,但它会不断运行。 I let it run for 30 minutes, then killed it. 我让它运行30分钟,然后将其杀死。 I don't think the program should take so long, so I'm guessing there is something wrong with my logic. 我认为程序不会花这么长时间,所以我猜我的逻辑有问题。

I have a feeling that creating a whole new list of lists is pretty stupid. 我觉得创建一个全新的列表列表非常愚蠢。 Any help would be much appreciated, and please feel free to nitpick as I am learning. 任何帮助将不胜感激,请随时随地学习。 Also please correct my vocabulary if it is incorrect. 如果不正确,请更正我的词汇。

I'd do it like this: 我会这样:

seen = set()
cond = [x for x in c if x[3] not in seen and not seen.add(x[3])]

Explanation: 说明:

seen is a set which keeps track of already encountered fourth elements of each sublist. seen一个集合,该集合跟踪每个子列表中已经遇到的第四个元素。 cond is the condensed list. cond是简要列表。 In case x[3] (where x is a sublist in c ) is not in seen , x will be added to cond and x[3] will be added to seen . 如果未seen x[3] (其中xc的子列表),则将x添加到cond并将x[3]添加到seen

seen.add(x[3]) will return None , so not seen.add(x[3]) will always be True , but that part will only be evaluated if x[3] not in seen is True since Python uses short circuit evaluation. seen.add(x[3])将返回None ,所以not seen.add(x[3])将始终是True ,但是这仅仅部分将被评估,如果x[3] not in seenTrue ,因为Python使用短电路评估。 If the second condition gets evaluated, it will always return True and have the side effect of adding x[3] to seen . 如果第二个条件得到评估,它将始终返回True并具有将x[3]添加到seen的副作用。 Here's another example of what's happening ( print returns None and has the "side-effect" of printing something): 这是正在发生的事情的另一个示例( print返回None并且具有打印某些东西的“副作用”):

>>> False and not print('hi')
False
>>> True and not print('hi')
hi
True

You have a significant logic flaw in your current code: 您当前的代码中存在一个明显的逻辑缺陷:

for items in d:
    if bact[3] != items[3]:
        d.append(bact)  

this adds bact to d once for every item in d that doesn't match . 这增加了bactd 一次在每个项目d不匹配 For a minimal fix, you need to switch to: 要获得最低限度的修复,您需要切换到:

for items in d:
    if bact[3] == items[3]:
        break
else:
    d.append(bact)  

to add bact once if all items in d don't match. 如果d 所有项目都不匹配,则添加一次bact I suspect this will mean your code runs in more sensible time. 我怀疑这将意味着您的代码在更合理的时间内运行。


On top of that, one obvious performance improvement (speed boost, albeit at the cost of memory usage) would be to keep a set of fourth elements you've seen so far. 最重要的是,一个显着的性能提升(提升速度,虽然在内存使用成本)将是保持一个set到目前为止,您已经看到了第四个元素。 Lookups on the set use hashes, so the membership test (highlighted) will be much quicker. 集合上的查找使用哈希,因此隶属度测试(突出显示)将更快。

d = []
seen = set()
for bact in c:
    if bact[3] not in seen: # membership test
        seen.add(bact[3])
        d.append(bact)

Use pandas. 使用大熊猫。 I assume you have better column names as well. 我认为您也有更好的列名。

c = [['470', '4189.0', 'asdfgw', 'fds'],
     ['470', '4189.0', 'qwer', 'fds'],
     ['470', '4189.0', 'qwer', 'dsfs fdv']]
import pandas as pd
df = pd.DataFrame(c, columns=['col_1', 'col_2', 'col_3', 'col_4'])
df.drop_duplicates('col_4', inplace=True)
print df

  col_1   col_2   col_3     col_4
0   470  4189.0  asdfgw       fds
2   470  4189.0    qwer  dsfs fdv

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM