简体   繁体   中英

python: find first string in string

Given a string and a list of substrings I want to the first position any substring occurs in the string. If no substring occurs, return 0. I want to ignore case.

Is there something more pythonic than:

given = 'Iamfoothegreat'
targets = ['foo', 'bar', 'grea', 'other']
res = len(given)
for t in targets:
    i = given.lower().find(t)
    if i > -1 and i < res:
        res = i

if res == len(given):
    result = 0
else:
    result = res

That code works, but seems inefficient.

I would not return 0 as the could be the start index, either use -1, None or some other value that is not a possibility, you can simply use a try/except and return the index:

def get_ind(s, targ):
    s = s.lower() 
    for t in targets:
        try:            
            return s.index(t.lower())
        except ValueError:
            pass
    return None # -1, False ...

If you want to ignore case for the input string also then set s = s.lower() before the loop.

You could also do something like:

def get_ind_next(s, targ):
   s = s.lower() 
   return next((s.index(t) for t in map(str.lower,targ) if t in s), None)

But that is doing at worst two lookups for each substring as opposed to one with a try/except. It will at least also short circuit on the first match.

If you actually want the min of all then change to:

def get_ind(s, targ):
    s = s.lower()
    mn = float("inf")
    for t in targ:
        try:
            i = s.index(t.lower()) 
            if i < mn:
                mn = i 
        except ValueError:
            pass
    return mn   

def get_ind_next(s, targ):
   s = s.lower()
   return min((s.index(t) for t in map(str.lower, targ) if t in s), default=None)

The default=None only works in python >= 3.4 so if you are using python2 then you are going to have to slightly change the logic.

Timings python3:

In [29]: s = "hello world" * 5000
In [30]:  s += "grea" + s
In [25]: %%timeit
   ....: targ = [re.escape(x) for x in targets]
   ....: pattern = r"%(pattern)s" % {'pattern' : "|".join(targ)}
   ....: firstMatch = next(re.finditer(pattern, s, re.IGNORECASE),None)
   ....: if firstMatch:
   ....:     pass
   ....: 
100 loops, best of 3: 5.11 ms per loop
In [18]: timeit get_ind_next(s, targets)
1000 loops, best of 3: 691 µs per loop

In [19]: timeit get_ind(s, targets)
1000 loops, best of 3: 627 µs per loop

In [20]:  timeit  min([s.lower().find(x.lower()) for x in targets if x.lower() in s.lower()] or [0])
1000 loops, best of 3: 1.03 ms per loop

In [21]: s = 'Iamfoothegreat'
In [22]: targets = ['bar', 'grea', 'other','foo']
In [23]: get_ind_next(s, targets) == get_ind(s, targets) ==  min([s.lower().find(x.lower()) for x in targets if x.lower() in s.lower()] or [0])
Out[24]: True

Python2:

In [13]: s = "hello world" * 5000
In [14]:  s += "grea" + s

In [15]: targets = ['foo', 'bar', 'grea', 'other']
In [16]: timeit get_ind(s, targets)1000 loops, 
best of 3: 322 µs per loop

In [17]:  timeit  min([s.lower().find(x.lower()) for x in targets if x.lower() in s.lower()] or [0])
1000 loops, best of 3: 710 µs per loop

In [18]: get_ind(s, targets) ==  min([s.lower().find(x.lower()) for x in targets if x.lower() in s.lower()] or [0])
Out[18]: True

You can also combine the first with min:

def get_ind(s, targ):
    s,mn = s.lower(), None
    for t in targ:
        try:
            mn = s.index(t.lower())
            yield mn
        except ValueError:
            pass
    yield mn

Which does the same job, it is just a bit nicer and maybe slightly faster:

In [45]: min(get_ind(s, targets))
Out[45]: 55000

In [46]: timeit min(get_ind(s, targets))
1000 loops, best of 3: 317 µs per loop

Use regex

Another example just use regex, cause think the python regex implementation is super fast. Not my regex function is

import re

given = 'IamFoothegreat'
targets = ['foo', 'bar', 'grea', 'other']

targets = [re.escape(x) for x in targets]    
pattern = r"%(pattern)s" % {'pattern' : "|".join(targets)}
firstMatch = next(re.finditer(pattern, given, re.IGNORECASE),None)
if firstMatch:
    print firstMatch.start()
    print firstMatch.group()

Output is

3
foo

If nothing is found output is nothing. Should be self explained to check if nothing is found.

Much more normal not really pythonic

Give you the matched string, too

given = 'Iamfoothegreat'.lower()
targets = ['foo', 'bar', 'grea', 'other']

dct = {'pos' : - 1, 'string' : None};
given = given.lower()

for t in targets:
    i = given.find(t)
    if i > -1 and (i < list['pos'] or list['pos'] == -1):
        dct['pos'] = i;
        dct['string'] = t;

print dct

Output is:

{'pos': 3, 'string': 'foo'}

If element is not found:

{'pos': -1, 'string': None}

Performance Comparision of both

with this string and pattern

given = "hello world" * 5000
given += "grea" + given
targets = ['foo', 'bar', 'grea', 'other']

1000 loops with timeit:

regex approach: 4.08629107475 sec for 1000
normal approach: 1.80048894882 sec for 1000

10 loops. Now with much bigger targets (targets * 1000):

normal approach: 4.06895017624 for 10
regex approach: 34.8153910637 for 10

You could use the following:

answer = min([given.lower().find(x.lower()) for x in targets 
    if x.lower() in given.lower()] or [0])

Demo 1

given = 'Iamfoothegreat'
targets = ['foo', 'bar', 'grea', 'other']

answer = min([given.lower().find(x.lower()) for x in targets 
    if x.lower() in given.lower()] or [0])
print(answer)

Output

3

Demo 2

given = 'this is a different string'
targets = ['foo', 'bar', 'grea', 'other']

answer = min([given.lower().find(x.lower()) for x in targets 
    if x.lower() in given.lower()] or [0])
print(answer)

Output

0

I also think that the following solution is quite readable:

given = 'the string'
targets = ('foo', 'bar', 'grea', 'other')

given = given.lower()

for i in range(len(given)):
    if given.startswith(targets, i):
        print i
        break
else:
    print -1

Your code is fairly good, but you can make it a little more efficient by moving the .lower conversion out of the loop: there's no need to repeat it for each target substring. The code can be condensed a little using list comprehensions, although that doesn't necessarily make it faster. I use a nested list comp to avoid calling given.find(t) twice for each t .

I've wrapped my code in a function for easier testing.

def min_match(given, targets):
    given = given.lower()
    a = [i for i in [given.find(t) for t in targets] if i > -1]
    return min(a) if a else None

targets = ['foo', 'bar', 'grea', 'othe']

data = (
    'Iamfoothegreat', 
    'IAMFOOTHEGREAT', 
    'Iamfothgrease',
    'Iamfothgret',
)

for given in data:
    print(given, min_match(given, targets))    

output

Iamfoothegreat 3
IAMFOOTHEGREAT 3
Iamfothgrease 7
Iamfothgret None

Try this:

def getFirst(given,targets):
    try:
        return min([i for x in targets for i in [given.find(x)] if not i == -1])
    except ValueError:
        return 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM