删除没有set（）的重复项

Question

I have a .txt file of IPs, Times, Search Queries, and Websites accessed. 我有一个.txt文件，其中包含访问的IP，时间，搜索查询和网站。 I used a for loop to break them up into respective indices of a list, I then placed all these lists, into a larger list. 我使用了for循环将它们分解为一个列表的各个索引，然后将所有这些列表放入一个更大的列表中。

When printed it may look like this... 打印时可能看起来像这样...

['4.16.159.114', '08:13:37', 'french-english dictionary', 'humanities.uchicago.edu/forms_unrest/FR-ENG.html\n']
['4.16.186.203', '00:13:54', 's.e.t.i.', 'www.seti.net/\n']
['4.16.189.59', '05:48:58', 'which is better http upload or ftp upload', 'www.ewebtribe.com/htmlhelp/uploading.htm\n']
['4.16.189.59', '06:50:49', 'cgi perl tutorial', 'www.cgi101.com/class/\n']
['4.16.189.59', '07:16:28', 'cgi perl tutorial', 'www.free-ed.net/fr03/lfc/course%20030207_01/\n']

My code for getting to here looks like so, which is just me scraping this data from a text file, and putting it into a list, then writing to another text file. 我到达这里的代码如下所示，这就是我从文本文件中抓取这些数据，并将其放入列表中，然后写入另一个文本文件的代码。

import io

f = io.open(r'C:\Users\Ryan Asher\Desktop\%23AlltheWeb_2001.txt', encoding="Latin-1")
p = io.open(r'C:\Users\Ryan Asher\Desktop\workfile.txt', 'w')

sweet = [] 

for line in f:
    x = line.split("     ")
    lbreak = x[0].split("\t")
    sweet.append(lbreak)

for item in sweet:
    p.write("%s\n" % item)

My issue here is the 3rd index in the each list, within the sweet list or [2], which is the search query (french-english dictionary, seti, etc.). 我的问题是在甜清单或[2]中的每个列表中的第三个索引，这是搜索查询（法语-英语词典，seti等）。 I do not want multiples in the 'sweet' list. 我不想在“甜”列表中使用倍数。

So where it says 'cgi perl tutorial' but twice, I need to get rid of the other search of 'cgi perl tutorial', and only leave the first one, within the sweet list. 因此，在两次显示“ cgi perl教程”的地方，我需要摆脱对“ cgi perl教程”的其他搜索，只保留第一个搜索列表。

I can't use set for this I don't think, because I only want it to apply to the 3rd index of search queries, and I don't want it to get rid of duplicates of the same IP, or one of the others. 我不认为我不能使用set，因为我只希望它适用于搜索查询的第三个索引，而且我不希望它摆脱相同IP或其中一个IP的重复项其他。

Answer 1

Add lbreak[2] to a set, only append line that lbreak[2] not in the set, something like: 将lbreak[2]添加到集合中，仅追加lbreak[2]不在集合中的行，例如：

sweet = [] 
seen = set()

for line in f:
    x = line.split("     ")
    lbreak = x[0].split("\t")
    if lbreak[2] not in seen:
        sweet.append(lbreak)
        seen.add(lbreak[2])

Answer 2

Use a dict, with the query as the key and the entire list as the value. 使用字典，将查询作为键，将整个列表作为值。 Something like this (untested): 像这样（未经测试）：

sweet = {}

for line in f:
    ...
    query = lbreak[2]
    if query not in sweet:
        sweet[query] = lbreak

If you wanted the last instance of each query instead of the first, you could just lose the if , and do the assignment unconditionally. 如果您希望每个查询的最后一个实例而不是第一个实例，则可能会丢失if ，并无条件地进行分配。

删除没有set（）的重复项

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-09-01 03:12:38

解决方案2
1 2016-09-01 03:09:49

删除没有set（）的重复项

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-09-01 03:12:38

解决方案2 1 2016-09-01 03:09:49

解决方案1
3 已采纳 2016-09-01 03:12:38

解决方案2
1 2016-09-01 03:09:49