检查字符串是否包含另一个字符串的一定数量的单词

Question

Say we have a string 1 ABCDEF and a string 2 BDE (The letters are just for demo, in reality they are words). 假设我们有一个字符串1 ABCDEF和一个字符串2 BDE （这些字母仅用于演示，实际上它们是单词）。 Now I would like to find out if there are any n conscutive "words" from string 2 in string 1. To convert the string to "words", I'd use string.split() . 现在我想知道字符串1中是否有来自字符串2的任何n连续“单词”。要将字符串转换为“单词”，我将使用string.split() 。

For example for n equals 2, I would like to check whether BD or DE is - in this order - in string 1. BD is not in this order in the string, but DE is. 例如，对于n等于2，我想检查BD或DE是否 - 按此顺序 - 在字符串1中BD在字符串中不是这个顺序，但DE是。

Does anyone see a pythonic way of doing this? 有没有人看到这样做的pythonic方式？

I do have a solution for n equals 2 but realized that I need it for arbitrary n. 我确实有一个n等于2的解决方案，但意识到我需要任意n。 Also it is not particularily beautiful: 它也不是特别漂亮：

def string_contains_words_of_string(words_str, words_to_check_str):
    words = words_str.split()
    words_to_check = words_to_check_str.split()

    found_word_index = None
    for word in words:
        start = 0 if found_word_index is None else found_word_index + 1
        for i, word_to_check in enumerate(words_to_check[start:]):
            if word_to_check == word:
                if found_word_index is not None:
                    return True
                found_word_index = i
                break
            else:
                found_word_index = None
    return False

Answer 1

This is easy with a regex: 使用正则表达式很容易：

>>> import re
>>> st1='A B C D E F'
>>> st2='B D E'
>>> n=2
>>> pat=r'(?=({}))'.format(r's+'.join(r'\w+' for i in range(n)))
>>> print [(s, s in st1) for s in re.findall(pat, st2)]
[('B D', False), ('D E', True)]

The key is to use a zero width look ahead to find overlapping matches in the string. 关键是使用零宽度向前看以找到字符串中的重叠匹配。 So: 所以：

>>> re.findall('(?=(\\w+\\s+\\w+))', 'B D E')
['B D', 'D E']

Now build that for n repetitions of the word found by \\w+ with: 现在为\\w+找到的单词重复n次构建：

>>> n=2
>>> r'(?=({}))'.format(r's\+'.join(r'\w+' for i in range(n)))
'(?=(\\w+\\s+\\w+))'

Now since you have two strings, use Python's in operator to produce a tuple of the result of s from the regex matches to the target string. 现在，因为你有两个字符串，所以使用Python的in运算符来生成从正则表达式匹配到目标字符串的s结果的元组。

Of course if you want a non-regex to do this, just produce substrings n words by n: 当然，如果你想要一个非正则表达式来做这个，只需用n生成n个字的子串：

>>> li=st2.split()
>>> n=2
>>> [(s, s in st1) for s in (' '.join(li[i:i+n]) for i in range(len(li)-n+1))]
[('B D', False), ('D E', True)]

And if you want the index (either method) you can use str.find : 如果你想要索引（任一方法），你可以使用str.find ：

>>> [(s, st1.find(s)) for s in (' '.join(li[i:i+n]) for i in range(len(li)-n+1)) 
...     if s in st1]
[('D E', 6)]

For regex that goes word by word, make sure you use a word boundary anchor: 对于逐字逐句的正则表达式，请确保使用单词边界锚点：

>>> st='wordW wordX wordY wordZ'
>>> re.findall(r'(?=(\b\w+\s\b\w+))', st)
['wordW wordX', 'wordX wordY', 'wordY wordZ']

Answer 2

you could build ngrams like so: 你可以像这样构建ngrams：

a = 'this is an example, whatever'.split()
b = 'this is another example, whatever'.split()

def ngrams(string, n):
    return set(zip(*[string[i:] for i in range(n)]))

def common_ngrams(string1, string2, n):
    return ngrams(string1, n) & ngrams(string2, n)

results: 结果：

print(common_ngrams(a, b, 2))
{('this', 'is'), ('example,', 'whatever')}

print(common_ngrams(a, b, 1))
{('this',), ('is',), ('example,',), ('whatever',)}

Note that the tricky bit is in the ngrams function with the zip function 请注意，棘手的位是带有zip函数的ngrams 函数

zip(*[string[i:] for i in range(n)]

This is essentialy the same as 这基本上是一样的

zip(string, string[1:], string[2:])

for n = 3. 对于n = 3。

Also note that we're using sets of tuples, this is the best performance wise... 另请注意，我们使用的是元组，这是最好的性能......

Answer 3

Lets say you have two strings (this can as easily be solved for strings containing more than just one letter each) 假设你有两个字符串（这可以很容易地解决每个包含多个字母的字符串）

a = 'this is a beautiful day'
b = 'this day is awful'

Then to get all the words of b that also belong to a you write 然后得到b的所有单词也属于你写的

x = [x for x in b.split() if x in a.split()]

Now x contains (after one line of code) 现在x包含（在一行代码之后）

['this', 'day', 'is']

Then you check whether the serial combinations of x (from 0 up len(x) ) belong in b 然后检查x的序列组合（从0到len(x) ）是否属于b

for i in range(len(x)):
    for j in range(i, len(x)+1):
        word = ' '.join(x[i:j])
        if word in b:
            print(word)

The Example prints the (order preservig) combinations of b 's words that are also present in a in the same order (it takes a small tweak in the if statement of the nested for) 示例打印b的单词的（order preservig）组合，这些单词也以相同的顺序出现在a中（在嵌套for的if语句中需要进行小的调整）

Answer 4

The longest common substring algorithm will work here, if you pass in a split list instead of a plain string - with the added bonus that it will also give the longest string made from the longest run of characters if you pass in the unsplit string. 如果传入拆分列表而不是普通字符串，那么最长的公共子字符串算法将在这里工作 - 如果传入未分割字符串，它还会提供由最长字符串组成的最长字符串。

def longest_common_substring(s1, s2):
    m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in xrange(1, 1 + len(s1)):
        for y in xrange(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
    return s1[x_longest - longest: x_longest]

检查字符串是否包含另一个字符串的一定数量的单词

问题描述

4 个解决方案

解决方案1
2 已采纳 2014-04-11 17:54:15

解决方案2
1 2014-04-11 17:37:33

解决方案3
0 2014-04-11 17:28:29

解决方案4
0 2014-04-11 18:02:34

检查字符串是否包含另一个字符串的一定数量的单词

问题描述

4 个解决方案

解决方案1 2 已采纳 2014-04-11 17:54:15

解决方案2 1 2014-04-11 17:37:33

解决方案3 0 2014-04-11 17:28:29

解决方案4 0 2014-04-11 18:02:34

解决方案1
2 已采纳 2014-04-11 17:54:15

解决方案2
1 2014-04-11 17:37:33

解决方案3
0 2014-04-11 17:28:29

解决方案4
0 2014-04-11 18:02:34