檢查字符串是否包含另一個字符串的一定數量的單詞

Question

假設我們有一個字符串1 ABCDEF和一個字符串2 BDE （這些字母僅用於演示，實際上它們是單詞）。 現在我想知道字符串1中是否有來自字符串2的任何n連續“單詞”。要將字符串轉換為“單詞”，我將使用string.split() 。

例如，對於n等於2，我想檢查BD或DE是否 - 按此順序 - 在字符串1中BD在字符串中不是這個順序，但DE是。

有沒有人看到這樣做的pythonic方式？

我確實有一個n等於2的解決方案，但意識到我需要任意n。 它也不是特別漂亮：

def string_contains_words_of_string(words_str, words_to_check_str):
    words = words_str.split()
    words_to_check = words_to_check_str.split()

    found_word_index = None
    for word in words:
        start = 0 if found_word_index is None else found_word_index + 1
        for i, word_to_check in enumerate(words_to_check[start:]):
            if word_to_check == word:
                if found_word_index is not None:
                    return True
                found_word_index = i
                break
            else:
                found_word_index = None
    return False

Answer 1

使用正則表達式很容易：

>>> import re
>>> st1='A B C D E F'
>>> st2='B D E'
>>> n=2
>>> pat=r'(?=({}))'.format(r's+'.join(r'\w+' for i in range(n)))
>>> print [(s, s in st1) for s in re.findall(pat, st2)]
[('B D', False), ('D E', True)]

關鍵是使用零寬度向前看以找到字符串中的重疊匹配。 所以：

>>> re.findall('(?=(\\w+\\s+\\w+))', 'B D E')
['B D', 'D E']

現在為\\w+找到的單詞重復n次構建：

>>> n=2
>>> r'(?=({}))'.format(r's\+'.join(r'\w+' for i in range(n)))
'(?=(\\w+\\s+\\w+))'

現在，因為你有兩個字符串，所以使用Python的in運算符來生成從正則表達式匹配到目標字符串的s結果的元組。

當然，如果你想要一個非正則表達式來做這個，只需用n生成n個字的子串：

>>> li=st2.split()
>>> n=2
>>> [(s, s in st1) for s in (' '.join(li[i:i+n]) for i in range(len(li)-n+1))]
[('B D', False), ('D E', True)]

如果你想要索引（任一方法），你可以使用str.find ：

>>> [(s, st1.find(s)) for s in (' '.join(li[i:i+n]) for i in range(len(li)-n+1)) 
...     if s in st1]
[('D E', 6)]

對於逐字逐句的正則表達式，請確保使用單詞邊界錨點：

>>> st='wordW wordX wordY wordZ'
>>> re.findall(r'(?=(\b\w+\s\b\w+))', st)
['wordW wordX', 'wordX wordY', 'wordY wordZ']

Answer 2

你可以像這樣構建ngrams：

a = 'this is an example, whatever'.split()
b = 'this is another example, whatever'.split()

def ngrams(string, n):
    return set(zip(*[string[i:] for i in range(n)]))

def common_ngrams(string1, string2, n):
    return ngrams(string1, n) & ngrams(string2, n)

結果：

print(common_ngrams(a, b, 2))
{('this', 'is'), ('example,', 'whatever')}

print(common_ngrams(a, b, 1))
{('this',), ('is',), ('example,',), ('whatever',)}

請注意，棘手的位是帶有zip函數的ngrams 函數

zip(*[string[i:] for i in range(n)]

這基本上是一樣的

zip(string, string[1:], string[2:])

對於n = 3。

另請注意，我們使用的是元組，這是最好的性能......

Answer 3

假設你有兩個字符串（這可以很容易地解決每個包含多個字母的字符串）

a = 'this is a beautiful day'
b = 'this day is awful'

然后得到b的所有單詞也屬於你寫的

x = [x for x in b.split() if x in a.split()]

現在x包含（在一行代碼之后）

['this', 'day', 'is']

然后檢查x的序列組合（從0到len(x) ）是否屬於b

for i in range(len(x)):
    for j in range(i, len(x)+1):
        word = ' '.join(x[i:j])
        if word in b:
            print(word)

示例打印b的單詞的（order preservig）組合，這些單詞也以相同的順序出現在a中（在嵌套for的if語句中需要進行小的調整）

Answer 4

如果傳入拆分列表而不是普通字符串，那么最長的公共子字符串算法將在這里工作 - 如果傳入未分割字符串，它還會提供由最長字符串組成的最長字符串。

def longest_common_substring(s1, s2):
    m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in xrange(1, 1 + len(s1)):
        for y in xrange(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
    return s1[x_longest - longest: x_longest]

檢查字符串是否包含另一個字符串的一定數量的單詞

問題描述

4 個解決方案

解決方案1
2 已采納 2014-04-11 17:54:15

解決方案2
1 2014-04-11 17:37:33

解決方案3
0 2014-04-11 17:28:29

解決方案4
0 2014-04-11 18:02:34

檢查字符串是否包含另一個字符串的一定數量的單詞

問題描述

4 個解決方案

解決方案1 2 已采納 2014-04-11 17:54:15

解決方案2 1 2014-04-11 17:37:33

解決方案3 0 2014-04-11 17:28:29

解決方案4 0 2014-04-11 18:02:34

解決方案1
2 已采納 2014-04-11 17:54:15

解決方案2
1 2014-04-11 17:37:33

解決方案3
0 2014-04-11 17:28:29

解決方案4
0 2014-04-11 18:02:34