简体   繁体   English

查找字符串中第 n 次出现的 substring

[英]Find the nth occurrence of substring in a string

This seems like it should be pretty trivial, but I am new at Python and want to do it the most Pythonic way.这看起来应该是微不足道的,但我是 Python 的新手,想以最 Pythonic 的方式来做。

I want to find the index corresponding to the n'th occurrence of a substring within a string.我想找到与字符串中第 n 次出现的 substring 对应的索引。

There's got to be something equivalent to what I WANT to do which is必须有一些等同于我想做的事情

mystring.find("substring", 2nd)

How can you achieve this in Python?你怎么能在 Python 中做到这一点?

Here's a more Pythonic version of the straightforward iterative solution:这是直接迭代解决方案的更 Pythonic 版本:

def find_nth(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+len(needle))
        n -= 1
    return start

Example:例子:

>>> find_nth("foofoofoofoo", "foofoo", 2)
6

If you want to find the nth overlapping occurrence of needle , you can increment by 1 instead of len(needle) , like this:如果您想找到第 n 个重叠出现的needle ,您可以增加1而不是len(needle) ,如下所示:

def find_nth_overlapping(haystack, needle, n):
    start = haystack.find(needle)
    while start >= 0 and n > 1:
        start = haystack.find(needle, start+1)
        n -= 1
    return start

Example:例子:

>>> find_nth_overlapping("foofoofoofoo", "foofoo", 2)
3

This is easier to read than Mark's version, and it doesn't require the extra memory of the splitting version or importing regular expression module.这比Mark的版本更容易阅读,并且不需要拆分版本或导入正则表达式模块的额外内存。 It also adheres to a few of the rules in the Zen of python , unlike the various re approaches:与各种re方法不同,它还遵循PythonZen 中的一些规则:

  1. Simple is better than complex.简单胜于复杂。
  2. Flat is better than nested.扁平比嵌套好。
  3. Readability counts.可读性很重要。

Mark's iterative approach would be the usual way, I think.我认为,马克的迭代方法将是通常的方法。

Here's an alternative with string-splitting, which can often be useful for finding-related processes:这是字符串拆分的替代方法,它通常对查找相关过程很有用:

def findnth(haystack, needle, n):
    parts= haystack.split(needle, n+1)
    if len(parts)<=n+1:
        return -1
    return len(haystack)-len(parts[-1])-len(needle)

And here's a quick (and somewhat dirty, in that you have to choose some chaff that can't match the needle) one-liner:这是一个快速的(有点脏,因为你必须选择一些与针头不匹配的箔条)单线:

'foo bar bar bar'.replace('bar', 'XXX', 1).find('bar')

This will find the second occurrence of substring in string.这将在字符串中找到第二次出现的子字符串。

def find_2nd(string, substring):
   return string.find(substring, string.find(substring) + 1)

Edit: I haven't thought much about the performance, but a quick recursion can help with finding the nth occurrence:编辑:我对性能没有考虑太多,但是快速递归可以帮助找到第 n 个出现:

def find_nth(string, substring, n):
   if (n == 1):
       return string.find(substring)
   else:
       return string.find(substring, find_nth(string, substring, n - 1) + 1)

Understanding that regex is not always the best solution, I'd probably use one here:了解正则表达式并不总是最好的解决方案,我可能会在这里使用一个:

>>> import re
>>> s = "ababdfegtduab"
>>> [m.start() for m in re.finditer(r"ab",s)]
[0, 2, 11]
>>> [m.start() for m in re.finditer(r"ab",s)][2] #index 2 is third occurrence 
11

I'm offering some benchmarking results comparing the most prominent approaches presented so far, namely @bobince's findnth() (based on str.split() ) vs. @tgamblin's or @Mark Byers' find_nth() (based on str.find() ).我提供了一些基准测试结果,比较了迄今为止提出的最突出的方法,即 @bobince 的findnth() (基于str.split() )与 @tgamblin 或 @Mark Byers 的find_nth() (基于str.find() )。 I will also compare with a C extension ( _find_nth.so ) to see how fast we can go.我还将与 C 扩展 ( _find_nth.so ) 进行比较,看看我们能走多快。 Here is find_nth.py :这是find_nth.py

def findnth(haystack, needle, n):
    parts= haystack.split(needle, n+1)
    if len(parts)<=n+1:
        return -1
    return len(haystack)-len(parts[-1])-len(needle)

def find_nth(s, x, n=0, overlap=False):
    l = 1 if overlap else len(x)
    i = -l
    for c in xrange(n + 1):
        i = s.find(x, i + l)
        if i < 0:
            break
    return i

Of course, performance matters most if the string is large, so suppose we want to find the 1000001st newline ('\\n') in a 1.3 GB file called 'bigfile'.当然,如果字符串很大,性能最重要,因此假设我们要在名为“bigfile”的 1.3 GB 文件中找到第 1000001 个换行符 ('\\n')。 To save memory, we would like to work on an mmap.mmap object representation of the file:为了节省内存,我们想处理文件的mmap.mmap对象表示:

In [1]: import _find_nth, find_nth, mmap

In [2]: f = open('bigfile', 'r')

In [3]: mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

There is already the first problem with findnth() , since mmap.mmap objects don't support split() . findnth()已经存在第一个问题,因为mmap.mmap对象不支持split() So we actually have to copy the whole file into memory:所以我们实际上必须将整个文件复制到内存中:

In [4]: %time s = mm[:]
CPU times: user 813 ms, sys: 3.25 s, total: 4.06 s
Wall time: 17.7 s

Ouch!哎哟! Fortunately s still fits in the 4 GB of memory of my Macbook Air, so let's benchmark findnth() :幸运的是s仍然适合我的 Macbook Air 的 4 GB 内存,所以让我们对findnth()基准测试:

In [5]: %timeit find_nth.findnth(s, '\n', 1000000)
1 loops, best of 3: 29.9 s per loop

Clearly a terrible performance.显然是糟糕的表现。 Let's see how the approach based on str.find() does:让我们看看基于str.find()的方法是如何做的:

In [6]: %timeit find_nth.find_nth(s, '\n', 1000000)
1 loops, best of 3: 774 ms per loop

Much better!好多了! Clearly, findnth() 's problem is that it is forced to copy the string during split() , which is already the second time we copied the 1.3 GB of data around after s = mm[:] .显然, findnth()的问题在于它在split()期间被迫复制字符串,这已经是我们在s = mm[:]之后第二次复制 1.3 GB 的数据。 Here comes in the second advantage of find_nth() : We can use it on mm directly, such that zero copies of the file are required:这是find_nth()的第二个优点:我们可以直接在mm上使用它,这样就需要文件的副本:

In [7]: %timeit find_nth.find_nth(mm, '\n', 1000000)
1 loops, best of 3: 1.21 s per loop

There appears to be a small performance penalty operating on mm vs. s , but this illustrates that find_nth() can get us an answer in 1.2 s compared to findnth 's total of 47 s.mms上运行时似乎有一个小的性能损失,但这说明find_nth()可以在 1.2 秒内为我们提供答案,而findnth的总时间为 47 秒。

I found no cases where the str.find() based approach was significantly worse than the str.split() based approach, so at this point, I would argue that @tgamblin's or @Mark Byers' answer should be accepted instead of @bobince's.我没有发现基于str.find()的方法明显比基于str.find()的方法差的str.split() ,所以在这一点上,我认为应该接受 @tgamblin 或 @Mark Byers 的答案而不是 @bobince 的答案.

In my testing, the version of find_nth() above was the fastest pure Python solution I could come up with (very similar to @Mark Byers' version).在我的测试中,上面的find_nth()版本是我能想到的最快的纯 Python 解决方案(与 @Mark Byers 的版本非常相似)。 Let's see how much better we can do with a C extension module.让我们看看我们可以用 C 扩展模块做得多好。 Here is _find_nthmodule.c :这是_find_nthmodule.c

#include <Python.h>
#include <string.h>

off_t _find_nth(const char *buf, size_t l, char c, int n) {
    off_t i;
    for (i = 0; i < l; ++i) {
        if (buf[i] == c && n-- == 0) {
            return i;
        }
    }
    return -1;
}

off_t _find_nth2(const char *buf, size_t l, char c, int n) {
    const char *b = buf - 1;
    do {
        b = memchr(b + 1, c, l);
        if (!b) return -1;
    } while (n--);
    return b - buf;
}

/* mmap_object is private in mmapmodule.c - replicate beginning here */
typedef struct {
    PyObject_HEAD
    char *data;
    size_t size;
} mmap_object;

typedef struct {
    const char *s;
    size_t l;
    char c;
    int n;
} params;

int parse_args(PyObject *args, params *P) {
    PyObject *obj;
    const char *x;

    if (!PyArg_ParseTuple(args, "Osi", &obj, &x, &P->n)) {
        return 1;
    }
    PyTypeObject *type = Py_TYPE(obj);

    if (type == &PyString_Type) {
        P->s = PyString_AS_STRING(obj);
        P->l = PyString_GET_SIZE(obj);
    } else if (!strcmp(type->tp_name, "mmap.mmap")) {
        mmap_object *m_obj = (mmap_object*) obj;
        P->s = m_obj->data;
        P->l = m_obj->size;
    } else {
        PyErr_SetString(PyExc_TypeError, "Cannot obtain char * from argument 0");
        return 1;
    }
    P->c = x[0];
    return 0;
}

static PyObject* py_find_nth(PyObject *self, PyObject *args) {
    params P;
    if (!parse_args(args, &P)) {
        return Py_BuildValue("i", _find_nth(P.s, P.l, P.c, P.n));
    } else {
        return NULL;    
    }
}

static PyObject* py_find_nth2(PyObject *self, PyObject *args) {
    params P;
    if (!parse_args(args, &P)) {
        return Py_BuildValue("i", _find_nth2(P.s, P.l, P.c, P.n));
    } else {
        return NULL;    
    }
}

static PyMethodDef methods[] = {
    {"find_nth", py_find_nth, METH_VARARGS, ""},
    {"find_nth2", py_find_nth2, METH_VARARGS, ""},
    {0}
};

PyMODINIT_FUNC init_find_nth(void) {
    Py_InitModule("_find_nth", methods);
}

Here is the setup.py file:这是setup.py文件:

from distutils.core import setup, Extension
module = Extension('_find_nth', sources=['_find_nthmodule.c'])
setup(ext_modules=[module])

Install as usual with python setup.py install .像往常一样使用python setup.py install The C code plays at an advantage here since it is limited to finding single characters, but let's see how fast this is: C 代码在这里发挥了优势,因为它仅限于查找单个字符,但让我们看看这有多快:

In [8]: %timeit _find_nth.find_nth(mm, '\n', 1000000)
1 loops, best of 3: 218 ms per loop

In [9]: %timeit _find_nth.find_nth(s, '\n', 1000000)
1 loops, best of 3: 216 ms per loop

In [10]: %timeit _find_nth.find_nth2(mm, '\n', 1000000)
1 loops, best of 3: 307 ms per loop

In [11]: %timeit _find_nth.find_nth2(s, '\n', 1000000)
1 loops, best of 3: 304 ms per loop

Clearly quite a bit faster still.显然还要快一些。 Interestingly, there is no difference on the C level between the in-memory and mmapped cases.有趣的是,in-memory 和 mmapped 情况在 C 级别上没有区别。 It is also interesting to see that _find_nth2() , which is based on string.h 's memchr() library function, loses out against the straightforward implementation in _find_nth() : The additional "optimizations" in memchr() are apparently backfiring...这也是有趣的是, _find_nth2()它是基于string.hmemchr()库函数,失去了对直接实现在_find_nth()在附加的‘优化’ memchr()显然事与愿违。 ..

In conclusion, the implementation in findnth() (based on str.split() ) is really a bad idea, since (a) it performs terribly for larger strings due to the required copying, and (b) it doesn't work on mmap.mmap objects at all.总之, findnth()的实现(基于str.split() )确实是一个坏主意,因为(a)由于需要复制,它对于较大的字符串执行得非常糟糕,并且(b)它不适用于mmap.mmap对象。 The implementation in find_nth() (based on str.find() ) should be preferred in all circumstances (and therefore be the accepted answer to this question). find_nth()的实现(基于str.find() )在所有情况下都应该是首选(因此是这个问题的公认答案)。

There is still quite a bit of room for improvement, since the C extension ran almost a factor of 4 faster than the pure Python code, indicating that there might be a case for a dedicated Python library function.仍有相当大的改进空间,因为 C 扩展的运行速度几乎比纯 Python 代码快 4 倍,这表明可能需要专用的 Python 库函数。

Simplest way?最简单的方法?

text = "This is a test from a test ok" 

firstTest = text.find('test')

print text.find('test', firstTest + 1)

I'd probably do something like this, using the find function that takes an index parameter:我可能会做这样的事情,使用带有索引参数的 find 函数:

def find_nth(s, x, n):
    i = -1
    for _ in range(n):
        i = s.find(x, i + len(x))
        if i == -1:
            break
    return i

print find_nth('bananabanana', 'an', 3)

It's not particularly Pythonic I guess, but it's simple.我猜它不是特别 Pythonic,但它很简单。 You could do it using recursion instead:你可以使用递归来代替:

def find_nth(s, x, n, i = 0):
    i = s.find(x, i)
    if n == 1 or i == -1:
        return i 
    else:
        return find_nth(s, x, n - 1, i + len(x))

print find_nth('bananabanana', 'an', 3)

It's a functional way to solve it, but I don't know if that makes it more Pythonic.这是一种解决它的实用方法,但我不知道这是否使它更像 Pythonic。

This will give you an array of the starting indices for matches to yourstring :这将为您提供与yourstring匹配的起始索引数组:

import re
indices = [s.start() for s in re.finditer(':', yourstring)]

Then your nth entry would be:那么你的第 n 个条目将是:

n = 2
nth_entry = indices[n-1]

Of course you have to be careful with the index bounds.当然,您必须小心索引边界。 You can get the number of instances of yourstring like this:您可以像这样获取yourstring的实例数:

num_instances = len(indices)

Here is another approach using re.finditer.这是使用 re.finditer 的另一种方法。
The difference is that this only looks into the haystack as far as necessary不同之处在于,这只会在必要时查看大海捞针

from re import finditer
from itertools import dropwhile
needle='an'
haystack='bananabanana'
n=2
next(dropwhile(lambda x: x[0]<n, enumerate(re.finditer(needle,haystack))))[1].start() 

Here's another re + itertools version that should work when searching for either a str or a RegexpObject .这是在搜索strRegexpObject时应该工作的另一个re + itertools版本。 I will freely admit that this is likely over-engineered, but for some reason it entertained me.我会坦率地承认这可能是过度设计的,但出于某种原因,它让我很开心。

import itertools
import re

def find_nth(haystack, needle, n = 1):
    """
    Find the starting index of the nth occurrence of ``needle`` in \
    ``haystack``.

    If ``needle`` is a ``str``, this will perform an exact substring
    match; if it is a ``RegexpObject``, this will perform a regex
    search.

    If ``needle`` doesn't appear in ``haystack``, return ``-1``. If
    ``needle`` doesn't appear in ``haystack`` ``n`` times,
    return ``-1``.

    Arguments
    ---------
    * ``needle`` the substring (or a ``RegexpObject``) to find
    * ``haystack`` is a ``str``
    * an ``int`` indicating which occurrence to find; defaults to ``1``

    >>> find_nth("foo", "o", 1)
    1
    >>> find_nth("foo", "o", 2)
    2
    >>> find_nth("foo", "o", 3)
    -1
    >>> find_nth("foo", "b")
    -1
    >>> import re
    >>> either_o = re.compile("[oO]")
    >>> find_nth("foo", either_o, 1)
    1
    >>> find_nth("FOO", either_o, 1)
    1
    """
    if (hasattr(needle, 'finditer')):
        matches = needle.finditer(haystack)
    else:
        matches = re.finditer(re.escape(needle), haystack)
    start_here = itertools.dropwhile(lambda x: x[0] < n, enumerate(matches, 1))
    try:
        return next(start_here)[1].start()
    except StopIteration:
        return -1

Building on modle13 's answer, but without the re module dependency.建立在modle13的答案上,但没有re模块依赖性。

def iter_find(haystack, needle):
    return [i for i in range(0, len(haystack)) if haystack[i:].startswith(needle)]

I kinda wish this was a builtin string method.我有点希望这是一个内置的字符串方法。

>>> iter_find("http://stackoverflow.com/questions/1883980/", '/')
[5, 6, 24, 34, 42]

For the special case where you search for the n'th occurence of a character (ie substring of length 1), the following function works by building a list of all positions of occurences of the given character:对于搜索字符第 n 次出现的特殊情况(即长度为 1 的子字符串),以下函数通过构建给定字符的所有出现位置的列表来工作:

def find_char_nth(string, char, n):
    """Find the n'th occurence of a character within a string."""
    return [i for i, c in enumerate(string) if c == char][n-1]

If there are fewer than n occurences of the given character, it will give IndexError: list index out of range .如果给定字符的出现次数少于n ,则会给出IndexError: list index out of range

This is derived from @Zv_oDD's answer and simplified for the case of a single character.这源自@Zv_oDD 的答案,并针对单个字符的情况进行了简化。

>>> s="abcdefabcdefababcdef"
>>> j=0
>>> for n,i in enumerate(s):
...   if s[n:n+2] =="ab":
...     print n,i
...     j=j+1
...     if j==2: print "2nd occurence at index position: ",n
...
0 a
6 a
2nd occurence at index position:  6
12 a
14 a

Providing another "tricky" solution, which use split and join .提供另一个“棘手”的解决方案,它使用splitjoin

In your example, we can use在您的示例中,我们可以使用

len("substring".join([s for s in ori.split("substring")[:2]]))
# return -1 if nth substr (0-indexed) d.n.e, else return index
def find_nth(s, substr, n):
    i = 0
    while n >= 0:
        n -= 1
        i = s.find(substr, i + 1)
    return i

Solution without using loops and recursion.不使用循环和递归的解决方案。

Use the required pattern in compile method and enter the desired occurrence in variable 'n' and the last statement will print the starting index of the nth occurrence of the pattern in the given string.在 compile 方法中使用所需的模式并在变量'n' 中输入所需的出现次数,最后一条语句将打印给定字符串中该模式第 n 次出现的起始索引。 Here the result of finditer ie iterator is being converted to list and directly accessing the nth index.这里 finditer 的结果即 iterator 被转换为 list 并直接访问第 n 个索引。

import re
n=2
sampleString="this is history"
pattern=re.compile("is")
matches=pattern.finditer(sampleString)
print(list(matches)[n].span()[0])

Here is my solution for finding n th occurrance of b in string a :这是我在字符串a找到第n次出现b解决方案:

from functools import reduce


def findNth(a, b, n):
    return reduce(lambda x, y: -1 if y > x + 1 else a.find(b, x + 1), range(n), -1)

It is pure Python and iterative.它是纯 Python 和迭代的。 For 0 or n that is too large, it returns -1.对于过大的 0 或n ,它返回 -1。 It is one-liner and can be used directly.它是单行的,可以直接使用。 Here is an example:下面是一个例子:

>>> reduce(lambda x, y: -1 if y > x + 1 else 'bibarbobaobaotang'.find('b', x + 1), range(4), -1)
7

The replace one liner is great but only works because XX and bar have the same lentgh更换一个衬垫很棒,但只能起作用,因为 XX 和 bar 具有相同的长度

A good and general def would be:一个好的和通用的定义是:

def findN(s,sub,N,replaceString="XXX"):
    return s.replace(sub,replaceString,N-1).find(sub) - (len(replaceString)-len(sub))*(N-1)

Def:定义:

def get_first_N_words(mytext, mylen = 3):
    mylist = list(mytext.split())
    if len(mylist)>=mylen: return ' '.join(mylist[:mylen])

To use:使用:

get_first_N_words('  One Two Three Four ' , 3)

Output:输出:

'One Two Three'

Avoid a failure or incorrect output when the input value for occurrence provided is higher than the actual count of occurrence.当提供的出现输入值高于实际出现次数时,避免出现故障或错误输出。 For example, in a string 'overflow' if you would check the 3rd occurrence of 'o' ( it has only 2 occurrences ) then below code will return a warning or message indicating that the occurrence value has exceeded.例如,在字符串 'overflow' 中,如果您要检查 'o' 的第 3 次出现(它只有 2 次出现),那么下面的代码将返回一个警告或消息,指示已超出出现值。

Input Occurrence entered has exceeded the actual count of Occurrence.输入的出现次数超过了实际的出现次数。

def check_nth_occurrence (string, substr, n):

## Count the Occurrence of a substr
    cnt = 0
    for i in string:
        if i ==substr:
            cnt = cnt + 1
        else:
            pass

## Check if the Occurrence input has exceeded the actual count of Occurrence

    if n > cnt:
        print (f' Input Occurrence entered has exceeded the actual count of Occurrence')
        return

## Get the Index value for first Occurrence of the substr

   index = string.find(substr)

## Get the Index value for nth Occurrence of Index
    while index >= 0 and n > 1:
        index = string.find(substr, index+ 1)
        n -= 1
  return index

Here's a simple and fun way to do it:这是一个简单而有趣的方法:

def index_of_nth(text, substring, n) -> int:
    index = 0
    for _ in range(n):
        index = text.index(substring, index) + 1
    return index - 1

Just in-case anyone wants to find n-th from the back:以防万一有人想从后面找到第 n 个:

def find_nth_reverse(haystack: str, needle: str, n: int) -> int:
    end = haystack.rfind(needle)

    while end >= 0 and n > 1:
        end = haystack.rfind(needle, 0, end - len(needle))
        n -= 1

    return end

I used findnth() function and ran into some issues, so I rewrote a faster version of the function (no list splitting):我使用了 findnth() 函数并遇到了一些问题,所以我重写了一个更快的函数版本(没有列表拆分):

def findnth(haystack, needle, n):
    if not needle in haystack or haystack.count(needle) < n:
        return -1

    last_index = 0
    cumulative_last_index = 0
    for i in range(0, n):
        last_index = haystack[cumulative_last_index:].find(needle)
        cumulative_last_index += last_index
        
        # if not last element, then jump over it
        if i < n-1:
            cumulative_last_index += len(needle)

    return cumulative_last_index

I solved it like this.我是这样解决的。

def second_index(text: str, symbol: str) -> [int, None]:
"""
    returns the second index of a symbol in a given text
"""
first = text.find(symbol)
result = text.find(symbol,first+1)
if result > 0: return result 

This is the answer you really want:这是您真正想要的答案:

def Find(String,ToFind,Occurence = 1):
index = 0 
count = 0
while index <= len(String):
    try:
        if String[index:index + len(ToFind)] == ToFind:
            count += 1
        if count == Occurence:
               return index
               break
        index += 1
    except IndexError:
        return False
        break
return False

A simple solution for those with basic programming knowledge:对于具有基本编程知识的人来说,一个简单的解决方案:

# Function to find the nth occurrence of a substring in a text
def findnth(text, substring, n):

# variable to store current index in loop
count = -1

# n count
occurance = 0

# loop through string
for letter in text:
    
    # increment count
    count += 1
    
    # if current letter in loop matches substring target
    if letter == substring:
        
        # increment occurance
        occurance += 1
        
        # if this is the nth time the substring is found
        if occurance == n:
            
            # return its index
            return count
        
# otherwise indicate there is no match
return "No match"

# example of how to call function
print(findnth('C$100$150xx', "$", 2))

How about:怎么样:

c = os.getcwd().split('\\')
print '\\'.join(c[0:-2])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM