简体   繁体   English

检查一个字符串是否以相同的顺序包含另一个字符串的所有单词python?

[英]Check if a string contains all words of another string in the same order python?

I want to check if a string contains all of the substring's words and retains their order; 我想检查一个字符串是否包含所有子字符串的单词并保留其顺序; at the moment I am using the following code; 目前,我正在使用以下代码; However it is very basic, seems inefficient and likely there is a much better way of doing it. 但是,这是非常基本的,似乎效率很低,并且可能有更好的方法来执行此操作。 I'd really appreciate if you could tell me what a more efficient solution would be. 如果您能告诉我什么是更有效的解决方案,我将不胜感激。 Sorry for a noob question, I am new to the programming and wasn't able to find a good solution 很抱歉遇到一个菜鸟问题,我是编程新手,无法找到一个好的解决方案

def check(main, sub_split):
    n=0
    while n < len(sub_split):
        result = True
        if sub_split[n] in main:
            the_start =  main.find(sub_split[n])
            main = main[the_start:]

        else:
            result=False
        n += 1
    return result

a = "I believe that the biggest castle in the world is Prague Castle "
b= "the biggest castle".split(' ')

print check(a, b)

update: interesting; 更新:有趣; First of all thank you all for your answers. 首先,谢谢大家的回答。 Also thank you for pointing out some of the spots that my code missed. 也感谢您指出我的代码遗漏的一些地方。 I have been trying different solutions posted here and in the links, I will add update how they compare and accept the answer then. 我一直在尝试在此处和链接中发布的其他解决方案,我将添加更新它们的比较方式,然后接受答案。

update: Again thank you all for great solutions, every one of them had major improvements compared to my code; 更新:再次感谢大家提供了出色的解决方案,与我的代码相比,每个解决方案都有重大改进; I checked the suggestions with my requirements for 100000 checks and got the following results; 我按照100000张支票的要求检查了建议,并得到以下结果; suggestions by: Padraic Cunningham - consistently under 0.4 secs (though gives some false positives when searching for only full words; galaxyan - 0.65 secs; 0.75 secs friendly dog - 0.70 secs John1024 - 1.3 secs (Highly accurate, but seems to take extra time) 的建议:Padraic Cunningham-始终低于0.4秒(尽管仅搜索完整单词时会产生一些误报; galaxyan-0.65秒; 0.75秒友好狗-0.70秒John1024-1.3秒(非常准确,但似乎要花费额外的时间)

You can simplify your search by passing the index of the previous match + 1 to find , you don't need to slice anything: 您可以通过传递上一个匹配项的索引+ 1来查找来简化搜索,而无需分割任何内容:

def check(main, sub_split):
    ind = -1
    for word in sub_split:
        ind = main.find(word, ind+1)
        if ind == -1:
            return False
    return True

a = "I believe that the biggest castle in the world is Prague Castle "
b= "the biggest castle".split(' ')

print check(a, b)

If ind is ever -1 then you get no match after so you return False, if you get thorough all the words then all words are in the string in order. 如果ind曾经是-1,则之后没有匹配项,因此返回False,如果彻底了解所有单词,则所有单词按顺序排列在字符串中。

For exact words you could do something similar with lists: 对于确切的单词,您可以对列表执行类似的操作:

def check(main, sub_split):
    lst, ind = main.split(), -1
    for word in sub_split:
        try:
           ind = lst.index(word, ind + 1)
        except ValueError:
            return False
    return True

And to handle punctuation, you could first strip it off: 要处理标点符号,您可以先将其剥离:

from string import punctuation

def check(main, sub_split):
    ind = -1
    lst = [w.strip(punctuation) for w in main.split()]
    for word in (w.strip(punctuation) for w sub_split):
        try:
           ind = lst.index(word, ind + 1)
        except ValueError:
            return False
    return True

Of course some words are valid with punctuation but that is more a job for nltk or you may actually want to find matches including any punctuation. 当然,有些单词可以使用标点符号,但对于nltk而言,这是一项更大的工作,否则您可能实际上希望查找包含任何标点符号的匹配项。

Let's define your a string and reformat your b string into a regex: 让我们定义a字符串并将b字符串重新格式化为正则表达式:

>>> a = "I believe that the biggest castle in the world is Prague Castle "
>>> b = r'\b' + r'\b.*\b'.join(re.escape(word) for word in "the biggest castle".split(' ')) + r'\b'

This tests to see if the words in b appear in the same order in a: 这将测试b中的单词是否以相同的顺序出现在a中:

>>> import re
>>> bool(re.search(b, a))
True

Caveat: If speed is important, a non-regex approach may be faster. 注意:如果速度很重要,则非正则表达式方法可能会更快。

How it works 这个怎么运作

The key thing here is the reformulation of the string into a regex: 这里的关键是将字符串重新格式化为正则表达式:

>>> b = r'\b' + r'\b.*\b'.join(re.escape(word) for word in "the biggest castle".split(' ')) + r'\b'
>>> print(b)
\bthe\b.*\bbiggest\b.*\bcastle\b

\\b matches only at word boundaries. \\b仅在单词边界匹配。 This means, for example, that the word the will never be confused with the word there . 这意味着,例如,这个词the永远不会混淆的词there Further, this regex requires that all the words be present in the target string in the same order. 此外,此正则表达式要求所有单词以相同顺序出现在目标字符串中。

If a contains a match to the regex b , then re.search(b, a) returns a match object. 如果a包含与正则表达式b的匹配项,则re.search(b, a)返回一个匹配对象。 Otherwise, it returns None . 否则,它返回None Thus, bool(re.search(b, a)) returns True only if a match was found. 因此, bool(re.search(b, a))仅在找到匹配项时返回True

Example with punctuation 标点符号示例

Because word boundaries treat punctuation as not word characters, this approach is not confused by punctuation: 因为单词边界将标点符号视为不是单词字符,所以这种方法不会被标点符号所混淆:

>>> a = 'From here, I go there.'
>>> b = 'here there'
>>> b = r'\b' + r'\b.*\b'.join(re.escape(word) for word in b.split(' ')) + r'\b'
>>> bool(re.search(b, a))
True

if you just want to check whether there is a word contain in other string, no need to check all. 如果您只想检查其他字符串中是否包含一个单词,则无需全部检查。 You just need to find one and return true. 您只需要找到一个并返回true。
when you check the item set is faster O(1)(average) 当您检查项目集更快时O(1)(平均)

a = "I believe that the biggest castle in the world is Prague Castle "
b = "the biggest castle"

def check(a,b):
    setA,lstB = set( a.split() ), b.split() 
    if len(setA) < len(lstB): return False 
    for item in lstB:
        if item in setA:
            return True
    return False

print check(a,b)

if you donot care the speed 如果你不在乎速度

def check(a,b):
    setA,lstB = set( a.split() ), b.split() 
    return len(setA) >= len(lstB) and any( 1 for item in lstB if item in setA) 

for speed and time complexity: link 速度和时间复杂性: 链接

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM