简体   繁体   English

查找两个列表中不包含公共字符的所有字符串对

[英]Find all pairs of strings in two lists that contain no common characters

I have two lists of strings, and wish to find all pairs of strings between them that contain no common characters.我有两个字符串列表,希望在它们之间找到所有不包含公共字符的字符串对。 eg例如

list1 = ['abc', 'cde']
list2 = ['aij', 'xyz', 'abc']

desired output = [('abc', 'xyz'), ('cde', 'aij'), ('cde', 'xyz')]

I need this to be as efficient as possible because I am processing lists with millions of strings.我需要尽可能高效,因为我正在处理包含数百万个字符串的列表。 At the moment, my code follows this general pattern:目前,我的代码遵循以下一般模式:

output = []

for str1 in list1:    
    for str2 in list2:
        if len(set(str1) & set(str2)) == 0: 
             output.append((str1, str2))

This is O(n^2) and is taking many hours to run, does anyone have some suggestions for how to speed this up?这是 O(n^2) 并且需要很多小时才能运行,有人对如何加快速度有一些建议吗? Perhaps there is a way to take advantage of characters in each string being sorted?也许有一种方法可以利用正在排序的每个字符串中的字符?

Thank you very much in advance!非常感谢您提前!

Here's another tack, focusing on lowering the set operations to bit twiddling and combining words that represent the same set of letters:这是另一种策略,专注于将集合操作降低到位操作和组合表示同一组字母的单词:

import collections
import string


def build_index(words):
    index = collections.defaultdict(list)
    for word in words:
        chi = sum(1 << string.ascii_lowercase.index(letter) for letter in set(word))
        index[chi].append(word)
    return index


def disjoint_pairs(words1, words2):
    index1 = build_index(words1)
    index2 = build_index(words2)
    for chi1, words1 in index1.items():
        for chi2, words2 in index2.items():
            if chi1 & chi2:
                continue
            for word1 in words1:
                for word2 in words2:
                    yield word1, word2


print(list(disjoint_pairs(["abc", "cde"], ["aij", "xyz", "abc"])))

Try this and tell me if there is any improvement:试试这个并告诉我是否有任何改进:

import itertools

[i for i in itertools.product(list1, list2) if len(i[0]+i[1])==len(set(i[0]+i[1]))]

Output:输出:

[('abc', 'xyz'), ('cde', 'aij'), ('cde', 'xyz')]

It's tricky to analyze the running time of this algorithm, but it's what I'd try first.分析这个算法的运行时间很棘手,但这是我首先要尝试的。 The idea is that, given a letter, we can split the problem into three subproblems: (words without the letter, words without the letter), (words without the letter, words with the letter), (words with the letter, words without the letter).这个想法是,给定一个字母,我们可以将问题分成三个子问题:(没有字母的单词,没有字母的单词),(没有字母的单词,有字母的单词),(有字母的单词,没有字母的单词)信)。 The code below chooses this letter (the "pivot") to maximize the number of pairs eliminated.下面的代码选择这个字母(“枢轴”)来最大化消除的对数。 In the base case, no pair can be eliminated, and we just output all pairs.在基本情况下,不能消除任何对,我们只输出所有对。

Python 3, optimized for readability over running time. Python 3,针对运行时间的可读性进行了优化。

import collections


def frequencies(words):
    return collections.Counter(letter for word in words for letter in set(word))


def partition(pivot, words):
    return (
        [word for word in words if pivot not in word],
        [word for word in words if pivot in word],
    )


def disjoint_pairs(words1, words2):
    freq1 = frequencies(words1)
    freq2 = frequencies(words2)
    pivots = set(freq1.keys()) & set(freq2.keys())
    if pivots:
        pivot = max(pivots, key=lambda letter: freq1[letter] * freq2[letter])
        no1, yes1 = partition(pivot, words1)
        no2, yes2 = partition(pivot, words2)
        yield from disjoint_pairs(no1, no2)
        yield from disjoint_pairs(no1, yes2)
        yield from disjoint_pairs(yes1, no2)
    else:
        for word1 in words1:
            for word2 in words2:
                yield (word1, word2)


print(list(disjoint_pairs(["abc", "cde"], ["aij", "xyz", "abc"])))

You can use recursion with a generator:您可以将递归与生成器一起使用:

from functools import reduce
list1 = ['abc', 'cde']
list2 = ['aij', 'xyz', 'abc']
def pairs(d, c = []):
   if not d and not reduce(lambda x, y:set(x)&set(y), c):
      yield tuple(c)
   elif d:
      yield from [i for k in d[0] for i in pairs(d[1:], c+[k])]

print(list(pairs([list1, list2])))

Output:输出:

[('abc', 'xyz'), ('cde', 'aij'), ('cde', 'xyz')]

This answer uses functools.reduce in order to handle cases where the number of input lists is greater than two.此答案使用functools.reduce来处理输入列表数量大于两个的情况。 That way, the set intersection of all the elements in the potential sublist can more easily computed.这样,可以更容易地计算潜在子列表中所有元素的集合交集。

I was thinking about how to exploit the fact that the strings are ordered and came up with the following crude idea:我在考虑如何利用字符串是有序的这一事实,并提出了以下粗略的想法:

Step 1 : Sort the second list, only with respect to the first character of the strings:第 1 步:仅针对字符串的第一个字符对第二个列表进行排序:

from itertools import product

list1 = ['abc', 'cde']
list2 = ['aij', 'xyz', 'abc']
list2 = sorted(list2, key=(lambda s: s[0]))

Step 2 : Find for each character c in the alphabet the index of the first element s in list2 which has a first character s[0] larger than c .步骤 2 :为字母表中的每个字符c查找list2第一个元素s的索引,该元素的s一个字符s[0]大于c (Here I assume that all the elements in the strings are actually characters from the alphabet and all lower case!) (这里我假设字符串中的所有元素实际上都是字母表中的字符并且都是小写的!)

alphabet = 'abcdefghijklmnopqrstuvwxyz'
bounds = {}
for c in alphabet:
    bounds[c] = len(list2)
    for i, s in enumerate(list2):
        if c < s[0]:
            bounds[c] = i
            break

Step 3 : With that preparation the iteration over the two lists can be optimised a little bit.第 3 步:通过这种准备,可以稍微优化两个列表的迭代。 For an element str1 from list1 you most likely don't have to go through all elements in list2 and do the check: At a certain point you know that the rest of the strings in list2 are distinct to str1 .对于list1的元素str1 ,您很可能不必遍历list2所有元素并进行检查:在某个时刻,您知道list2中的其余字符串与str1不同。

output = []
for str1 in list1:
    last_char = str1[-1]
    for str2 in list2[:bounds[last_char]]:
        if len(set(str1) & set(str2)) == 0:
            output.append((str1, str2))
    output += [*product([str1], list2[bounds[last_char]:])]

Caveats: I don't know if that actually helps!警告:我不知道这是否真的有帮助! You have to sort list2 and still have to collect the certain combinations (this part: output += [*product([str1], list2[bounds[last_char]:])] ), and I don't know how much that costs.您必须对list2进行排序,并且仍然必须收集某些组合(这部分: output += [*product([str1], list2[bounds[last_char]:])] ),我不知道要花多少钱.

PS: The code here is only intended to illustrate the idea (and I hope it doesn't contain any errors - it's late here), an actual implementation would look differently. PS:这里的代码只是为了说明这个想法(我希望它不包含任何错误——这里已经晚了),实际的实现看起来会有所不同。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM