简体   繁体   English

python:在string中查找第一个字符串

[英]python: find first string in string

Given a string and a list of substrings I want to the first position any substring occurs in the string. 给定一个字符串和一个子字符串列表,我想要第一个位置,任何子字符串都出现在字符串中。 If no substring occurs, return 0. I want to ignore case. 如果没有子字符串出现,则返回0.我想忽略大小写。

Is there something more pythonic than: 有什么比pythonic更pythonic:

given = 'Iamfoothegreat'
targets = ['foo', 'bar', 'grea', 'other']
res = len(given)
for t in targets:
    i = given.lower().find(t)
    if i > -1 and i < res:
        res = i

if res == len(given):
    result = 0
else:
    result = res

That code works, but seems inefficient. 该代码有效,但似乎效率低下。

I would not return 0 as the could be the start index, either use -1, None or some other value that is not a possibility, you can simply use a try/except and return the index: 我不会返回0因为可能是起始索引,要么使用-1,无或其他一些不可能的值,您可以简单地使用try / except并返回索引:

def get_ind(s, targ):
    s = s.lower() 
    for t in targets:
        try:            
            return s.index(t.lower())
        except ValueError:
            pass
    return None # -1, False ...

If you want to ignore case for the input string also then set s = s.lower() before the loop. 如果你想忽略输入字符串的大小写,那么在循环之前设置s = s.lower()

You could also do something like: 你也可以这样做:

def get_ind_next(s, targ):
   s = s.lower() 
   return next((s.index(t) for t in map(str.lower,targ) if t in s), None)

But that is doing at worst two lookups for each substring as opposed to one with a try/except. 但是,对于每个子字符串而言,最糟糕的是两次查找,而不是使用try / except。 It will at least also short circuit on the first match. 它至少也会在第一场比赛中短路。

If you actually want the min of all then change to: 如果你真的想要所有的min,那么改为:

def get_ind(s, targ):
    s = s.lower()
    mn = float("inf")
    for t in targ:
        try:
            i = s.index(t.lower()) 
            if i < mn:
                mn = i 
        except ValueError:
            pass
    return mn   

def get_ind_next(s, targ):
   s = s.lower()
   return min((s.index(t) for t in map(str.lower, targ) if t in s), default=None)

The default=None only works in python >= 3.4 so if you are using python2 then you are going to have to slightly change the logic. default=None仅适用于python> = 3.4,因此如果您使用的是python2,那么您将不得不稍微更改逻辑。

Timings python3: Timings python3:

In [29]: s = "hello world" * 5000
In [30]:  s += "grea" + s
In [25]: %%timeit
   ....: targ = [re.escape(x) for x in targets]
   ....: pattern = r"%(pattern)s" % {'pattern' : "|".join(targ)}
   ....: firstMatch = next(re.finditer(pattern, s, re.IGNORECASE),None)
   ....: if firstMatch:
   ....:     pass
   ....: 
100 loops, best of 3: 5.11 ms per loop
In [18]: timeit get_ind_next(s, targets)
1000 loops, best of 3: 691 µs per loop

In [19]: timeit get_ind(s, targets)
1000 loops, best of 3: 627 µs per loop

In [20]:  timeit  min([s.lower().find(x.lower()) for x in targets if x.lower() in s.lower()] or [0])
1000 loops, best of 3: 1.03 ms per loop

In [21]: s = 'Iamfoothegreat'
In [22]: targets = ['bar', 'grea', 'other','foo']
In [23]: get_ind_next(s, targets) == get_ind(s, targets) ==  min([s.lower().find(x.lower()) for x in targets if x.lower() in s.lower()] or [0])
Out[24]: True

Python2: Python2:

In [13]: s = "hello world" * 5000
In [14]:  s += "grea" + s

In [15]: targets = ['foo', 'bar', 'grea', 'other']
In [16]: timeit get_ind(s, targets)1000 loops, 
best of 3: 322 µs per loop

In [17]:  timeit  min([s.lower().find(x.lower()) for x in targets if x.lower() in s.lower()] or [0])
1000 loops, best of 3: 710 µs per loop

In [18]: get_ind(s, targets) ==  min([s.lower().find(x.lower()) for x in targets if x.lower() in s.lower()] or [0])
Out[18]: True

You can also combine the first with min: 你也可以将第一个与min结合起来:

def get_ind(s, targ):
    s,mn = s.lower(), None
    for t in targ:
        try:
            mn = s.index(t.lower())
            yield mn
        except ValueError:
            pass
    yield mn

Which does the same job, it is just a bit nicer and maybe slightly faster: 哪个做同样的工作,它只是更好一点,也许稍快一点:

In [45]: min(get_ind(s, targets))
Out[45]: 55000

In [46]: timeit min(get_ind(s, targets))
1000 loops, best of 3: 317 µs per loop

Use regex 使用正则表达式

Another example just use regex, cause think the python regex implementation is super fast. 另一个例子就是使用正则表达式,因为认为python正则表达式实现速度非常快。 Not my regex function is 不是我的正则表达式功能

import re

given = 'IamFoothegreat'
targets = ['foo', 'bar', 'grea', 'other']

targets = [re.escape(x) for x in targets]    
pattern = r"%(pattern)s" % {'pattern' : "|".join(targets)}
firstMatch = next(re.finditer(pattern, given, re.IGNORECASE),None)
if firstMatch:
    print firstMatch.start()
    print firstMatch.group()

Output is 输出是

3
foo

If nothing is found output is nothing. 如果什么也没发现输出什么都没有。 Should be self explained to check if nothing is found. 应该自我解释,以检查是否找不到任何东西。

Much more normal not really pythonic 更常见的不是pythonic

Give you the matched string, too 也给你匹配的字符串

given = 'Iamfoothegreat'.lower()
targets = ['foo', 'bar', 'grea', 'other']

dct = {'pos' : - 1, 'string' : None};
given = given.lower()

for t in targets:
    i = given.find(t)
    if i > -1 and (i < list['pos'] or list['pos'] == -1):
        dct['pos'] = i;
        dct['string'] = t;

print dct

Output is: 输出是:

{'pos': 3, 'string': 'foo'}

If element is not found: 如果未找到元素:

{'pos': -1, 'string': None}

Performance Comparision of both 两者的表现比较

with this string and pattern 用这个字符串和模式

given = "hello world" * 5000
given += "grea" + given
targets = ['foo', 'bar', 'grea', 'other']

1000 loops with timeit: 带有timeit的1000个循环:

regex approach: 4.08629107475 sec for 1000
normal approach: 1.80048894882 sec for 1000

10 loops. 10个循环。 Now with much bigger targets (targets * 1000): 现在有更大的目标(目标* 1000):

normal approach: 4.06895017624 for 10
regex approach: 34.8153910637 for 10

You could use the following: 您可以使用以下内容:

answer = min([given.lower().find(x.lower()) for x in targets 
    if x.lower() in given.lower()] or [0])

Demo 1 演示1

given = 'Iamfoothegreat'
targets = ['foo', 'bar', 'grea', 'other']

answer = min([given.lower().find(x.lower()) for x in targets 
    if x.lower() in given.lower()] or [0])
print(answer)

Output 产量

3

Demo 2 演示2

given = 'this is a different string'
targets = ['foo', 'bar', 'grea', 'other']

answer = min([given.lower().find(x.lower()) for x in targets 
    if x.lower() in given.lower()] or [0])
print(answer)

Output 产量

0

I also think that the following solution is quite readable: 我也认为以下解决方案非常易读:

given = 'the string'
targets = ('foo', 'bar', 'grea', 'other')

given = given.lower()

for i in range(len(given)):
    if given.startswith(targets, i):
        print i
        break
else:
    print -1

Your code is fairly good, but you can make it a little more efficient by moving the .lower conversion out of the loop: there's no need to repeat it for each target substring. 您的代码相当不错,但是您可以通过将.lower转换移出循环来使其更有效:不需要为每个目标子字符串重复它。 The code can be condensed a little using list comprehensions, although that doesn't necessarily make it faster. 使用列表推导可以稍微压缩代码,但这并不一定会使代码更快。 I use a nested list comp to avoid calling given.find(t) twice for each t . 我使用嵌套列表comp来避免为每个t调用given.find(t)两次。

I've wrapped my code in a function for easier testing. 我已将代码包装在函数中以便于测试。

def min_match(given, targets):
    given = given.lower()
    a = [i for i in [given.find(t) for t in targets] if i > -1]
    return min(a) if a else None

targets = ['foo', 'bar', 'grea', 'othe']

data = (
    'Iamfoothegreat', 
    'IAMFOOTHEGREAT', 
    'Iamfothgrease',
    'Iamfothgret',
)

for given in data:
    print(given, min_match(given, targets))    

output 产量

Iamfoothegreat 3
IAMFOOTHEGREAT 3
Iamfothgrease 7
Iamfothgret None

Try this: 尝试这个:

def getFirst(given,targets):
    try:
        return min([i for x in targets for i in [given.find(x)] if not i == -1])
    except ValueError:
        return 0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM