简体   繁体   English

删除没有set()的重复项

[英]Removing duplicates without set()

I have a .txt file of IPs, Times, Search Queries, and Websites accessed. 我有一个.txt文件,其中包含访问的IP,时间,搜索查询和网站。 I used a for loop to break them up into respective indices of a list, I then placed all these lists, into a larger list. 我使用了for循环将它们分解为一个列表的各个索引,然后将所有这些列表放入一个更大的列表中。

When printed it may look like this... 打印时可能看起来像这样...

['4.16.159.114', '08:13:37', 'french-english dictionary', 'humanities.uchicago.edu/forms_unrest/FR-ENG.html\n']
['4.16.186.203', '00:13:54', 's.e.t.i.', 'www.seti.net/\n']
['4.16.189.59', '05:48:58', 'which is better http upload or ftp upload', 'www.ewebtribe.com/htmlhelp/uploading.htm\n']
['4.16.189.59', '06:50:49', 'cgi perl tutorial', 'www.cgi101.com/class/\n']
['4.16.189.59', '07:16:28', 'cgi perl tutorial', 'www.free-ed.net/fr03/lfc/course%20030207_01/\n']

My code for getting to here looks like so, which is just me scraping this data from a text file, and putting it into a list, then writing to another text file. 我到达这里的代码如下所示,这就是我从文本文件中抓取这些数据,并将其放入列表中,然后写入另一个文本文件的代码。

import io

f = io.open(r'C:\Users\Ryan Asher\Desktop\%23AlltheWeb_2001.txt', encoding="Latin-1")
p = io.open(r'C:\Users\Ryan Asher\Desktop\workfile.txt', 'w')

sweet = [] 

for line in f:
    x = line.split("     ")
    lbreak = x[0].split("\t")
    sweet.append(lbreak)

for item in sweet:
    p.write("%s\n" % item)

My issue here is the 3rd index in the each list, within the sweet list or [2], which is the search query (french-english dictionary, seti, etc.). 我的问题是在甜清单或[2]中的每个列表中的第三个索引,这是搜索查询(法语-英语词典,seti等)。 I do not want multiples in the 'sweet' list. 我不想在“甜”列表中使用倍数。

So where it says 'cgi perl tutorial' but twice, I need to get rid of the other search of 'cgi perl tutorial', and only leave the first one, within the sweet list. 因此,在两次显示“ cgi perl教程”的地方,我需要摆脱对“ cgi perl教程”的其他搜索,只保留第一个搜索列表。

I can't use set for this I don't think, because I only want it to apply to the 3rd index of search queries, and I don't want it to get rid of duplicates of the same IP, or one of the others. 我不认为我不能使用set,因为我只希望它适用于搜索查询的第三个索引,而且我不希望它摆脱相同IP或其中一个IP的重复项其他。

Add lbreak[2] to a set, only append line that lbreak[2] not in the set, something like: lbreak[2]添加到集合中,仅追加lbreak[2]不在集合中的行,例如:

sweet = [] 
seen = set()

for line in f:
    x = line.split("     ")
    lbreak = x[0].split("\t")
    if lbreak[2] not in seen:
        sweet.append(lbreak)
        seen.add(lbreak[2])

Use a dict, with the query as the key and the entire list as the value. 使用字典,将查询作为键,将整个列表作为值。 Something like this (untested): 像这样(未经测试):

sweet = {}

for line in f:
    ...
    query = lbreak[2]
    if query not in sweet:
        sweet[query] = lbreak

If you wanted the last instance of each query instead of the first, you could just lose the if , and do the assignment unconditionally. 如果您希望每个查询的最后一个实例而不是第一个实例,则可能会丢失if ,并无条件地进行分配。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM