[英]Removing duplicates without set()
I have a .txt file of IPs, Times, Search Queries, and Websites accessed. 我有一个.txt文件,其中包含访问的IP,时间,搜索查询和网站。 I used a for loop to break them up into respective indices of a list, I then placed all these lists, into a larger list. 我使用了for循环将它们分解为一个列表的各个索引,然后将所有这些列表放入一个更大的列表中。
When printed it may look like this... 打印时可能看起来像这样...
['4.16.159.114', '08:13:37', 'french-english dictionary', 'humanities.uchicago.edu/forms_unrest/FR-ENG.html\n']
['4.16.186.203', '00:13:54', 's.e.t.i.', 'www.seti.net/\n']
['4.16.189.59', '05:48:58', 'which is better http upload or ftp upload', 'www.ewebtribe.com/htmlhelp/uploading.htm\n']
['4.16.189.59', '06:50:49', 'cgi perl tutorial', 'www.cgi101.com/class/\n']
['4.16.189.59', '07:16:28', 'cgi perl tutorial', 'www.free-ed.net/fr03/lfc/course%20030207_01/\n']
My code for getting to here looks like so, which is just me scraping this data from a text file, and putting it into a list, then writing to another text file. 我到达这里的代码如下所示,这就是我从文本文件中抓取这些数据,并将其放入列表中,然后写入另一个文本文件的代码。
import io
f = io.open(r'C:\Users\Ryan Asher\Desktop\%23AlltheWeb_2001.txt', encoding="Latin-1")
p = io.open(r'C:\Users\Ryan Asher\Desktop\workfile.txt', 'w')
sweet = []
for line in f:
x = line.split(" ")
lbreak = x[0].split("\t")
sweet.append(lbreak)
for item in sweet:
p.write("%s\n" % item)
My issue here is the 3rd index in the each list, within the sweet list or [2], which is the search query (french-english dictionary, seti, etc.). 我的问题是在甜清单或[2]中的每个列表中的第三个索引,这是搜索查询(法语-英语词典,seti等)。 I do not want multiples in the 'sweet' list. 我不想在“甜”列表中使用倍数。
So where it says 'cgi perl tutorial' but twice, I need to get rid of the other search of 'cgi perl tutorial', and only leave the first one, within the sweet list. 因此,在两次显示“ cgi perl教程”的地方,我需要摆脱对“ cgi perl教程”的其他搜索,只保留第一个搜索列表。
I can't use set for this I don't think, because I only want it to apply to the 3rd index of search queries, and I don't want it to get rid of duplicates of the same IP, or one of the others. 我不认为我不能使用set,因为我只希望它适用于搜索查询的第三个索引,而且我不希望它摆脱相同IP或其中一个IP的重复项其他。
Add lbreak[2]
to a set, only append line that lbreak[2]
not in the set, something like: 将lbreak[2]
添加到集合中,仅追加lbreak[2]
不在集合中的行,例如:
sweet = []
seen = set()
for line in f:
x = line.split(" ")
lbreak = x[0].split("\t")
if lbreak[2] not in seen:
sweet.append(lbreak)
seen.add(lbreak[2])
Use a dict, with the query as the key and the entire list as the value. 使用字典,将查询作为键,将整个列表作为值。 Something like this (untested): 像这样(未经测试):
sweet = {}
for line in f:
...
query = lbreak[2]
if query not in sweet:
sweet[query] = lbreak
If you wanted the last instance of each query instead of the first, you could just lose the if
, and do the assignment unconditionally. 如果您希望每个查询的最后一个实例而不是第一个实例,则可能会丢失if
,并无条件地进行分配。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.