简体   繁体   English

检查字符串是否包含另一个字符串的一定数量的单词

[英]Check if string contains a certain amount of words of another string

Say we have a string 1 ABCDEF and a string 2 BDE (The letters are just for demo, in reality they are words). 假设我们有一个字符串1 ABCDEF和一个字符串2 BDE (这些字母仅用于演示,实际上它们是单词)。 Now I would like to find out if there are any n conscutive "words" from string 2 in string 1. To convert the string to "words", I'd use string.split() . 现在我想知道字符串1中是否有来自字符串2的任何n连续“单词”。要将字符串转换为“单词”,我将使用string.split()

For example for n equals 2, I would like to check whether BD or DE is - in this order - in string 1. BD is not in this order in the string, but DE is. 例如,对于n等于2,我想检查BDDE是否 - 按此顺序 - 在字符串1中BD在字符串中不是这个顺序,但DE是。

Does anyone see a pythonic way of doing this? 有没有人看到这样做的pythonic方式?

I do have a solution for n equals 2 but realized that I need it for arbitrary n. 我确实有一个n等于2的解决方案,但意识到我需要任意n。 Also it is not particularily beautiful: 它也不是特别漂亮:

def string_contains_words_of_string(words_str, words_to_check_str):
    words = words_str.split()
    words_to_check = words_to_check_str.split()

    found_word_index = None
    for word in words:
        start = 0 if found_word_index is None else found_word_index + 1
        for i, word_to_check in enumerate(words_to_check[start:]):
            if word_to_check == word:
                if found_word_index is not None:
                    return True
                found_word_index = i
                break
            else:
                found_word_index = None
    return False

This is easy with a regex: 使用正则表达式很容易:

>>> import re
>>> st1='A B C D E F'
>>> st2='B D E'
>>> n=2
>>> pat=r'(?=({}))'.format(r's+'.join(r'\w+' for i in range(n)))
>>> print [(s, s in st1) for s in re.findall(pat, st2)]
[('B D', False), ('D E', True)]

The key is to use a zero width look ahead to find overlapping matches in the string. 关键是使用零宽度向前看以找到字符串中的重叠匹配。 So: 所以:

>>> re.findall('(?=(\\w+\\s+\\w+))', 'B D E')
['B D', 'D E']

Now build that for n repetitions of the word found by \\w+ with: 现在为\\w+找到的单词重复n次构建:

>>> n=2
>>> r'(?=({}))'.format(r's\+'.join(r'\w+' for i in range(n)))
'(?=(\\w+\\s+\\w+))'

Now since you have two strings, use Python's in operator to produce a tuple of the result of s from the regex matches to the target string. 现在,因为你有两个字符串,所以使用Python的in运算符来生成从正则表达式匹配到目标字符串的s结果的元组。


Of course if you want a non-regex to do this, just produce substrings n words by n: 当然,如果你想要一个非正则表达式来做这个,只需用n生成n个字的子串:

>>> li=st2.split()
>>> n=2
>>> [(s, s in st1) for s in (' '.join(li[i:i+n]) for i in range(len(li)-n+1))]
[('B D', False), ('D E', True)]

And if you want the index (either method) you can use str.find : 如果你想要索引(任一方法),你可以使用str.find

>>> [(s, st1.find(s)) for s in (' '.join(li[i:i+n]) for i in range(len(li)-n+1)) 
...     if s in st1]
[('D E', 6)]

For regex that goes word by word, make sure you use a word boundary anchor: 对于逐字逐句的正则表达式,请确保使用单词边界锚点:

>>> st='wordW wordX wordY wordZ'
>>> re.findall(r'(?=(\b\w+\s\b\w+))', st)
['wordW wordX', 'wordX wordY', 'wordY wordZ']

you could build ngrams like so: 你可以像这样构建ngrams:

a = 'this is an example, whatever'.split()
b = 'this is another example, whatever'.split()

def ngrams(string, n):
    return set(zip(*[string[i:] for i in range(n)]))

def common_ngrams(string1, string2, n):
    return ngrams(string1, n) & ngrams(string2, n)

results: 结果:

print(common_ngrams(a, b, 2))
{('this', 'is'), ('example,', 'whatever')}

print(common_ngrams(a, b, 1))
{('this',), ('is',), ('example,',), ('whatever',)}

Note that the tricky bit is in the ngrams function with the zip function 请注意,棘手的位是带有zip函数的ngrams 函数

zip(*[string[i:] for i in range(n)]

This is essentialy the same as 这基本上是一样的

zip(string, string[1:], string[2:])

for n = 3. 对于n = 3。

Also note that we're using sets of tuples, this is the best performance wise... 另请注意,我们使用的是元组,这是最好的性能......

Lets say you have two strings (this can as easily be solved for strings containing more than just one letter each) 假设你有两个字符串(这可以很容易地解决每个包含多个字母的字符串)

a = 'this is a beautiful day'
b = 'this day is awful'

Then to get all the words of b that also belong to a you write 然后得到b的所有单词也属于你写的

x = [x for x in b.split() if x in a.split()]

Now x contains (after one line of code) 现在x包含(在一行代码之后)

['this', 'day', 'is']

Then you check whether the serial combinations of x (from 0 up len(x) ) belong in b 然后检查x的序列组合(从0到len(x) )是否属于b

for i in range(len(x)):
    for j in range(i, len(x)+1):
        word = ' '.join(x[i:j])
        if word in b:
            print(word)

The Example prints the (order preservig) combinations of b 's words that are also present in a in the same order (it takes a small tweak in the if statement of the nested for) 示例打印b的单词的(order preservig)组合,这些单词也以相同的顺序出现在a中(在嵌套for的if语句中需要进行小的调整)

The longest common substring algorithm will work here, if you pass in a split list instead of a plain string - with the added bonus that it will also give the longest string made from the longest run of characters if you pass in the unsplit string. 如果传入拆分列表而不是普通字符串,那么最长的公共子字符串算法将在这里工作 - 如果传入未分割字符串,它还会提供由最长字符串组成的最长字符串。

def longest_common_substring(s1, s2):
    m = [[0] * (1 + len(s2)) for i in xrange(1 + len(s1))]
    longest, x_longest = 0, 0
    for x in xrange(1, 1 + len(s1)):
        for y in xrange(1, 1 + len(s2)):
            if s1[x - 1] == s2[y - 1]:
                m[x][y] = m[x - 1][y - 1] + 1
                if m[x][y] > longest:
                    longest = m[x][y]
                    x_longest = x
            else:
                m[x][y] = 0
    return s1[x_longest - longest: x_longest]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM