简体   繁体   English

通过子字符串检查列表中的元素

[英]Checking if element in list by substring

I have a list of urls ( unicode ), and there is a lot of repetition. 我有一个网址列表( unicode ),并且有很多重复。 For example, urls http://www.myurlnumber1.com and http://www.myurlnumber1.com/foo+%bar%baz%qux lead to the same place. 例如,网址http://www.myurlnumber1.comhttp://www.myurlnumber1.com/foo+%bar%baz%qux指向同一个地方。

So I need to weed out all of those duplicates. 所以我需要清除所有这些重复项。

My first idea was to check if the element's substring is in the list, like so: 我的第一个想法是检查元素的子字符串是否在列表中,如下所示:

for url in list:
    if url[:30] not in list:
        print(url)

However, it tries to mach literal url[:30] to a list element and obviously returns all of them, since there is no element that exactly matches url[:30] . 但是,它尝试将文字url[:30]加到list元素中并显然返回所有这些元素,因为没有与url[:30]完全匹配的元素。

Is there an easy way to solve this problem? 有没有简单的方法来解决这个问题?

EDIT: 编辑:

Often the host and path in the urls stays the same, but the parameters are different. 网址中的主机和路径通常保持不变,但参数不同。 For my purposes, a url with the same hostname and path, but different parameters are still the same url and constitute a duplicate. 出于我的目的,具有相同主机名和路径但不同参数的URL仍然是相同的URL并构成重复。

If you consider any netloc's to be the same you can parse with urllib.parse 如果您认为任何netloc都是相同的,您可以使用urllib.parse进行解析

from urllib.parse import  urlparse # python2 from urlparse import  urlparse 

u = "http://www.myurlnumber1.com/foo+%bar%baz%qux"

print(urlparse(u).netloc)

Which would give you: 哪个会给你:

www.myurlnumber1.com

So to get unique netlocs you could do something like: 因此,要获得独特的netlocs,您可以执行以下操作:

unique  = {urlparse(u).netloc for u in urls}

If you wanted to keep the url scheme: 如果你想保留网址方案:

urls  = ["http://www.myurlnumber1.com/foo+%bar%baz%qux", "http://www.myurlnumber1.com"]

unique = {"{}://{}".format(u.scheme, u.netloc) for u in map(urlparse, urls)}
print(unique)

Presuming they all have schemes and you don't have http and https for the same netloc and consider them to be the same. 假设它们都有方案,并且您没有相同netloc的http和https,并认为它们是相同的。

If you also want to add the path: 如果您还想添加路径:

unique = {u.netloc, u.path) for u in map(urlparse, urls)}

The table of attributes is listed in the docs: 属性表列在文档中:

Attribute   Index   Value   Value if not present
scheme  0   URL scheme specifier    scheme parameter
netloc  1   Network location part   empty string
path    2   Hierarchical path   empty string
params  3   Parameters for last path element    empty string
query   4   Query component empty string
fragment    5   Fragment identifier empty string
username        User name   None
password        Password    None
hostname        Host name (lower case)  None
port        Port number as integer, if present  None

You just need to use whatever you consider to be the unique parts. 你只需要使用你认为是独特部分的任何东西。

In [1]: from urllib.parse import  urlparse

In [2]: urls = ["http://www.url.com/foo-bar", "http://www.url.com/foo-bar?t=baz", "www.url.com/baz-qux",  "www.url.com/foo-bar?t=baz"]


In [3]: unique = {"".join((u.netloc, u.path)) for u in map(urlparse, urls)}

In [4]: 

In [4]: print(unique)
{'www.url.com/baz-qux', 'www.url.com/foo-bar'}

You can try adding another for loop, if you are fine with that. 你可以尝试添加另一个for循环,如果你对它好的话。 Something like: 就像是:

for url in list:  
    for i in range(len(list)):  
      if url[:30] not in list[i]:  
          print(url)  

That will compare every word with every other word to check for sameness. 这会将每个单词与其他单词进行比较以检查相同性。 That's just an example, I'm sure you could make it more robust. 这只是一个例子,我相信你可以让它更强大。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM