简体   繁体   English

如何在 Python 中比较两个列表的子字符串

[英]How to Compare Substrings of Two Lists in Python

I have two lists, shortened for this example:我有两个列表,在这个例子中被缩短了:

l1 = ['Chase Bank', 'Bank of America']

l2 = ['Chase Mobile: Bank & Invest', 'Elevations Credit Union Mobile']

I am trying to generate a list from l1 that is not in l2.我正在尝试从 l1 生成不在 l2 中的列表。 In this case;在这种情况下; 'Bank of America' would be the only item returned. “美国银行”将是唯一退回的物品。

Chase Bank (from l1) and Chase Mobile: Bank & Invest (from l2) are the same because they both contain the keyword 'Chase', so they wouldn't go into the exclusion list. Chase Bank(来自 l1)和 Chase Mobile:Bank & Invest(来自 l2)是相同的,因为它们都包含关键字“Chase”,因此它们不会进入排除列表。 But Bank of America should go into the list, even though 'Bank' appears both in 'Bank of America' and 'Bank & Invest'.但是美国银行应该进入列表,即使“银行”出现在“美国银行”和“银行与投资”中。

I have tried using set , just a for loop with if/in as well as using any with a list comprehension .我尝试过使用set ,只是一个带有if/infor循环,以及使用带有list comprehensionany I have also tried regex , but matching the pattern of substrings from one list to the other is proving to be very difficult for me.我也尝试过regex ,但是将一个列表中的子字符串模式匹配到另一个列表对我来说非常困难。

Is this possible with Python or should I broaden my approach?这对 Python 是否可行,或者我应该扩大我的方法吗?

Use list comprehension and re.sub to remove all undesired substrings from the elements of your first list.使用列表推导和re.sub从第一个列表的元素中删除所有不需要的子字符串。 Here, I remove bank , case-insensitively, with optional whitespace before and after it.在这里,我删除了bank ,不区分大小写,前后带有可选的空格。 Then use another list comprehension, this time to remove everything that is found in the second list.然后使用另一个列表推导,这一次删除在第二个列表中找到的所有内容。 Use enumerate to get both the index and the element from the list.使用enumerate从列表中获取索引和元素。 Also, use sets , which is optional and makes the code faster for long and/or repetitive lists.此外,使用sets ,这是可选的,可以使代码更快地处理冗长和/或重复的列表。

import re

lst1 = ['Chase Bank', 'Chase bank', 'Bank of America']
lst2 = ['Chase Mobile: Bank & Invest', 'Elevations Credit Union Mobile']
lst1_short = [re.sub(r'(?i)\s*\bbank\b\s*', '', s) for s in lst1]
print(lst1_short)
# ['Chase', 'Chase', 'of America']

lst1 = [s for i, s in enumerate(lst1) if
      not any(x for x in set(lst2) if lst1_short[i] in x)]
print(lst1)
# ['Bank of America']

Note: you can extend your list of stop words (here, only bank ) using regular expressions.注意:您可以使用正则表达式扩展您的停用词列表(这里只有bank )。 For example:例如:

re.sub(r'(?i)\s*\b(bank|credit union|institution for savings)\b\s*', '', s)

You should try something like this:你应该尝试这样的事情:

l1 = ['Chase Bank', 'Bank of America']
l2 = ['Chase Mobile: Bank & Invest', 'Elevations Credit Union Mobile']

def similar_substrings(l1, l2):
    word1 = [l1[i].split(" ") for i in range(len(l1))]
    word2 = [l2[i].split(" ") for i in range(len(l2))]
    words_in = []

    for string in l1:
        for string2 in l2:
            is_in = True
            for word in string:
                if word not in string2:
                    is_in = False
            if is_in:
                words_in.append(string)

    return words_in

print(similar_substrings(l1, l2))

I only checked if sentences from l2 were contained in l1 but you can modify it pretty easily to check both inclusions.我只检查了l2中的句子是否包含在l1中,但你可以很容易地修改它来检查两个包含。

You can do it with a list comprehension:您可以通过列表理解来做到这一点:

l2_chase = any('Chase' in j for j in l2)
[i for i in l1 if not ('Chase' in i and l2_chase)]

Output:输出:

['Bank of America']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM