简体   繁体   English

从字符串列表中删除重复项和子字符串

[英]Remove duplicates and substrings from a list of strings

Let's say that I have a list given by:假设我有一个列表:

a = [
    'www.google.com',
    'google.com',
    'tvi.pt',
    'ubs.ch',
    'google.it',
    'www.google.com'
]

I want to remove the duplicates and the substrings to keep a list like:我想删除重复项和子字符串以保留如下列表:

b = [
    'www.google.com',
    'tvi.pt',
    'ubs.ch',
    'google.it'
]

Do you know an efficient way to do that?你知道一个有效的方法吗?

The goal is to keep the string that is longer, that's why www.google.com is preferred over google.com .目标是保留更长的字符串,这就是为什么www.google.com优于google.com的原因。

This solution can be edited to better suit your needs.可以编辑此解决方案以更好地满足您的需求。 Edit the functions get_domain to better choose the grouping condition* and the choose_item function to better choose the best item of the group.编辑函数get_domain以更好地选择分组条件*和choose_item function 以更好地选择组中的最佳项目。

from itertools import groupby

a = ['www.google.com', 'google.com', 'tvi.pt', 'ubs.ch', 'google.it', 'www.google.com']

def get_domain(url):
    # Example: 'www.google.com' -> 'google.com'
    return '.'.join(url.split('.')[-2:])

def choose_item(iterable):
    # Ex. input: ['www.google.com', 'google.com',  'www.google.com']
    # Ex. output: 'www.google.com' (longest string)
    return sorted(iterable, key=lambda x: -len(x))[0]

results = []
for domain,grp in groupby(sorted(a, key=get_domain), key=get_domain):
    results.append(choose_item(grp))

print(results)

Output: Output:

['www.google.com', 'google.it', 'tvi.pt', 'ubs.ch']

*Another answers suggest the tld library. *另一个答案建议使用 tld 库。

If what you are looking for is a list of unique first level domains, given an arbitrary list of URLs, take a look at the tld module.如果您正在寻找的是唯一一级域的列表,给定一个任意的 URL 列表,请查看tld模块。 It will make things easier for you.它会让你的事情变得更容易。

Based on the documentation, here is a snippet that you can adapt for your needs:根据文档,这里有一个片段,您可以根据自己的需要进行调整:

from tld import get_fld

urls = [
    'www.google.com',
    'google.com',
    'tvi.pt',
    'ubs.ch',
    'google.it',
    'www.google.com'
]

unique_domains =  list({
    get_fld(url, fix_protocol=True) for url in urls
}) 

The code above will set unique_domains with:上面的代码将unique_domains设置为:

['ubs.ch', 'google.it', 'tvi.pt', 'google.com']

You can remove the duplicates as follows:您可以按如下方式删除重复项:

f = list(dict.fromkeys(a))

this will filter out the duplicates 'www.google.com' but not the substrings.这将过滤掉重复的“www.google.com”,但不会过滤掉子字符串。 This would need more clarification as Captain Caveman wrote in his comment.正如 Caveman 船长在他的评论中所写,这需要更多的澄清。

def remove_duplicates_and_substrings(input):
output = []
for i in input:
    if i not in output:
        if not any(i in s for s in output):
            output.append(i)
return output

It may not be the best approach but does exactly what it you want it to do.它可能不是最好的方法,但它确实可以按照您的意愿进行操作。 It works by first checking if the string from the input list is not already in the output list.它首先检查输入列表中的字符串是否不在 output 列表中。 Then it checks if any part of that is already in one of the output strings.然后它检查它的任何部分是否已经在 output 字符串之一中。 If this is not the case it will add it to the output list.如果不是这种情况,它将把它添加到 output 列表中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM