简体   繁体   English

Python对列表项的唯一列表进行排序

[英]Python sort unique list of lists' items

I can't seem to find a question on SO about my particular problem, so forgive me if this has been asked before! 我似乎找不到关于我的特定问题的问题,所以请原谅我以前的问题!

Anyway, I'm writing a script to loop through a set of URL's and give me a list of unique urls with unique parameters. 无论如何,我正在编写一个脚本来遍历一组URL,并为我提供具有唯一参数的唯一URL列表。

The trouble I'm having is actually comparing the parameters to eliminate multiple duplicates. 我遇到的麻烦实际上是比较参数以消除多个重复项。 It's a bit hard to explain, so some examples are probably in order: 有点难以解释,因此可能有一些示例:

Say I have a list of URL's like this 说我有一个这样的URL列表

  • hxxp://www.somesite.com/page.php?id=3&title=derp hxxp://www.somesite.com/page.php ID = 3&标题= DERP
  • hxxp://www.somesite.com/page.php?id=4&title=blah hxxp://www.somesite.com/page.php ID = 4&标题=胡说
  • hxxp://www.somesite.com/page.php?id=3&c=32&title=thing hxxp://www.somesite.com/page.php ID = 3& C = 32&标题=事
  • hxxp://www.somesite.com/page.php?b=33&id=3 hxxp://www.somesite.com/page.php B = 33&ID = 3

I have it parsing each URL into a list of lists, so eventually I have a list like this: 我将每个URL解析为一个列表列表,所以最终我有了一个这样的列表:

sort = [['id', 'title'], ['id', 'c', 'title'], ['b', 'id']]

I nee to figure out a way to give me just 2 lists in my list at that point: 我需要找出一种方法,以便在此时只给我2个列表:

new = [['id', 'c', 'title'], ['b', 'id']]

As of right now I've got a bit to sort it out a little, I know I'm close and I've been slamming my head against this for a couple days now :(. Any ideas? 截至目前,我还需要进行一些整理,我知道我已经接近了,而现在我已经将头撞了两天:(。有什么想法吗?

Thanks in advance! 提前致谢! :) :)

EDIT: Sorry for not being clear! 编辑:对不起,不清楚! This script is aimed at finding unique entry points for web applications post-spidering. 该脚本旨在为蜘蛛后的Web应用程序找到唯一的入口点。 Basically if a URL has 3 unique entry points 基本上,如果一个URL有3个唯一的入口点

['id', 'c', 'title']

I'd prefer that to the same link with 2 unique entry points, such as: 我希望该链接具有2个唯一的入口点,例如:

['id', 'title']

So I need my new list of lists to eliminate the one with 2 and prefer the one with 3 ONLY if the smaller variables are in the larger set. 因此,我需要新的列表列表,以消除带有2的列表,而仅当较小的变量位于较大的集合中时才喜欢带有3的列表。 If it's still unclear let me know, and thank you for the quick responses! 如果仍然不清楚,请告诉我,谢谢您的迅速答复! :) :)

I'll assume that subsets are considered "duplicates" (non-commutatively, of course)... 我假设子集被认为是“重复项”(当然是非可交换的)...

Start by converting each query into a set and ordering them all from largest to smallest. 首先将每个查询转换成一个集合,然后将它们从最大到最小进行排序。 Then add each query to a new list if it isn't a subset of an already-added query. 如果不是已添加查询的子集,则将每个查询添加到新列表中。 Since any set is a subset of itself, this logic covers exact duplicates: 由于任何集合都是其自身的子集,因此此逻辑涵盖了精确的重复项:

a = []
for q in sorted((set(q) for q in sort), key=len, reverse=True):
    if not any(q.issubset(Q) for Q in a):
        a.append(q)
a = [list(q) for q in a] # Back to lists, if you want

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM