简体   繁体   English

如何查找所有出现的子字符串?

[英]How to find all occurrences of a substring?

Python has string.find() and string.rfind() to get the index of a substring in a string. Python 有string.find()string.rfind()来获取字符串中子字符串的索引。

I'm wondering whether there is something like string.find_all() which can return all found indexes (not only the first from the beginning or the first from the end).我想知道是否有类似string.find_all()的东西可以返回所有找到的索引(不仅是从头开始的第一个或从结尾开始的第一个)。

For example:例如:

string = "test test test test"

print string.find('test') # 0
print string.rfind('test') # 15

#this is the goal
print string.find_all('test') # [0,5,10,15]

For counting the occurrences, see Count number of occurrences of a substring in a string .计算出现次数,请参阅计算字符串中子字符串的出现次数

There is no simple built-in string function that does what you're looking for, but you could use the more powerful regular expressions :没有简单的内置字符串函数可以满足您的需求,但您可以使用更强大的正则表达式

import re
[m.start() for m in re.finditer('test', 'test test test test')]
#[0, 5, 10, 15]

If you want to find overlapping matches, lookahead will do that:如果你想找到重叠的匹配, lookahead会这样做:

[m.start() for m in re.finditer('(?=tt)', 'ttt')]
#[0, 1]

If you want a reverse find-all without overlaps, you can combine positive and negative lookahead into an expression like this:如果你想要一个没有重叠的反向查找,你可以将正负前瞻组合成这样的表达式:

search = 'tt'
[m.start() for m in re.finditer('(?=%s)(?!.{1,%d}%s)' % (search, len(search)-1, search), 'ttt')]
#[1]

re.finditer returns a generator , so you could change the [] in the above to () to get a generator instead of a list which will be more efficient if you're only iterating through the results once. re.finditer返回一个生成器,因此您可以将上面的[]更改为()以获取生成器而不是列表,如果您只遍历结果一次,这将更有效。

>>> help(str.find)
Help on method_descriptor:

find(...)
    S.find(sub [,start [,end]]) -> int

Thus, we can build it ourselves:因此,我们可以自己构建它:

def find_all(a_str, sub):
    start = 0
    while True:
        start = a_str.find(sub, start)
        if start == -1: return
        yield start
        start += len(sub) # use start += 1 to find overlapping matches

list(find_all('spam spam spam spam', 'spam')) # [0, 5, 10, 15]

No temporary strings or regexes required.不需要临时字符串或正则表达式。

Use re.finditer :使用re.finditer

import re
sentence = input("Give me a sentence ")
word = input("What word would you like to find ")
for match in re.finditer(word, sentence):
    print (match.start(), match.end())

For word = "this" and sentence = "this is a sentence this this" this will yield the output:对于word = "this"sentence = "this is a sentence this this"这将产生 output:

(0, 4)
(19, 23)
(24, 28)

Here's a (very inefficient) way to get all (ie even overlapping) matches:这是一种(非常低效)获取所有(即甚至重叠)匹配的方法:

>>> string = "test test test test"
>>> [i for i in range(len(string)) if string.startswith('test', i)]
[0, 5, 10, 15]

Again, old thread, but here's my solution using a generator and plain str.find .再次,旧线程,但这是我使用生成器和普通str.find的解决方案。

def findall(p, s):
    '''Yields all the positions of
    the pattern p in the string s.'''
    i = s.find(p)
    while i != -1:
        yield i
        i = s.find(p, i+1)

Example例子

x = 'banananassantana'
[(i, x[i:i+2]) for i in findall('na', x)]

returns返回

[(2, 'na'), (4, 'na'), (6, 'na'), (14, 'na')]

You can use re.finditer() for non-overlapping matches.您可以使用re.finditer()进行非重叠匹配。

>>> import re
>>> aString = 'this is a string where the substring "is" is repeated several times'
>>> print [(a.start(), a.end()) for a in list(re.finditer('is', aString))]
[(2, 4), (5, 7), (38, 40), (42, 44)]

but won't work for:不适用于:

In [1]: aString="ababa"

In [2]: print [(a.start(), a.end()) for a in list(re.finditer('aba', aString))]
Output: [(0, 3)]

Come, let us recurse together.来吧,让我们一起递归。

def locations_of_substring(string, substring):
    """Return a list of locations of a substring."""

    substring_length = len(substring)    
    def recurse(locations_found, start):
        location = string.find(substring, start)
        if location != -1:
            return recurse(locations_found + [location], location+substring_length)
        else:
            return locations_found

    return recurse([], 0)

print(locations_of_substring('this is a test for finding this and this', 'this'))
# prints [0, 27, 36]

No need for regular expressions this way.这种方式不需要正则表达式。

If you're just looking for a single character, this would work:如果您只是在寻找一个字符,这将起作用:

string = "dooobiedoobiedoobie"
match = 'o'
reduce(lambda count, char: count + 1 if char == match else count, string, 0)
# produces 7

Also,还,

string = "test test test test"
match = "test"
len(string.split(match)) - 1
# produces 4

My hunch is that neither of these (especially #2) is terribly performant.我的直觉是这些(尤其是#2)都不是非常出色的。

this is an old thread but i got interested and wanted to share my solution.这是一个旧线程,但我很感兴趣并想分享我的解决方案。

def find_all(a_string, sub):
    result = []
    k = 0
    while k < len(a_string):
        k = a_string.find(sub, k)
        if k == -1:
            return result
        else:
            result.append(k)
            k += 1 #change to k += len(sub) to not search overlapping results
    return result

It should return a list of positions where the substring was found.它应该返回找到子字符串的位置列表。 Please comment if you see an error or room for improvment.如果您发现错误或改进空间,请发表评论。

This does the trick for me using re.finditer这对我来说是使用 re.finditer 的诀窍

import re

text = 'This is sample text to test if this pythonic '\
       'program can serve as an indexing platform for '\
       'finding words in a paragraph. It can give '\
       'values as to where the word is located with the '\
       'different examples as stated'

#  find all occurances of the word 'as' in the above text

find_the_word = re.finditer('as', text)

for match in find_the_word:
    print('start {}, end {}, search string \'{}\''.
          format(match.start(), match.end(), match.group()))

This thread is a little old but this worked for me:这个线程有点旧,但这对我有用:

numberString = "onetwothreefourfivesixseveneightninefiveten"
testString = "five"

marker = 0
while marker < len(numberString):
    try:
        print(numberString.index("five",marker))
        marker = numberString.index("five", marker) + 1
    except ValueError:
        print("String not found")
        marker = len(numberString)

You can try :你可以试试 :

>>> string = "test test test test"
>>> for index,value in enumerate(string):
    if string[index:index+(len("test"))] == "test":
        print index

0
5
10
15

Whatever the solutions provided by others are completely based on the available method find() or any available methods.其他人提供的任何解决方案都完全基于可用的方法 find() 或任何可用的方法。

What is the core basic algorithm to find all the occurrences of a substring in a string?查找字符串中所有出现的子字符串的核心基本算法是什么?

def find_all(string,substring):
    """
    Function: Returning all the index of substring in a string
    Arguments: String and the search string
    Return:Returning a list
    """
    length = len(substring)
    c=0
    indexes = []
    while c < len(string):
        if string[c:c+length] == substring:
            indexes.append(c)
        c=c+1
    return indexes

You can also inherit str class to new class and can use this function below.您也可以将 str 类继承到新类,并可以在下面使用此功能。

class newstr(str):
def find_all(string,substring):
    """
    Function: Returning all the index of substring in a string
    Arguments: String and the search string
    Return:Returning a list
    """
    length = len(substring)
    c=0
    indexes = []
    while c < len(string):
        if string[c:c+length] == substring:
            indexes.append(c)
        c=c+1
    return indexes

Calling the method调用方法

newstr.find_all('Do you find this answer helpful? then upvote this!','this') newstr.find_all('你觉得这个答案有帮助吗?那就点赞吧!','this')

When looking for a large amount of key words in a document, use flashtext在文档中查找大量关键词时,使用flashtext

from flashtext import KeywordProcessor
words = ['test', 'exam', 'quiz']
txt = 'this is a test'
kwp = KeywordProcessor()
kwp.add_keywords_from_list(words)
result = kwp.extract_keywords(txt, span_info=True)

Flashtext runs faster than regex on large list of search words. Flashtext 在大量搜索词上的运行速度比正则表达式快。

This function does not look at all positions inside the string, it does not waste compute resources.此函数不会查看字符串中的所有位置,它不会浪费计算资源。 My try:我的尝试:

def findAll(string,word):
    all_positions=[]
    next_pos=-1
    while True:
        next_pos=string.find(word,next_pos+1)
        if(next_pos<0):
            break
        all_positions.append(next_pos)
    return all_positions

to use it call it like this:使用它这样称呼它:

result=findAll('this word is a big word man how many words are there?','word')
src = input() # we will find substring in this string
sub = input() # substring

res = []
pos = src.find(sub)
while pos != -1:
    res.append(pos)
    pos = src.find(sub, pos + 1)

You can try :你可以试试 :

import re
str1 = "This dress looks good; you have good taste in clothes."
substr = "good"
result = [_.start() for _ in re.finditer(substr, str1)]
# result = [17, 32]

I think the most clean way of solution is without libraries and yields:我认为最干净的解决方案是没有库和产量:

def find_all_occurrences(string, sub):
    index_of_occurrences = []
    current_index = 0
    while True:
        current_index = string.find(sub, current_index)
        if current_index == -1:
            return index_of_occurrences
        else:
            index_of_occurrences.append(current_index)
            current_index += len(sub)

find_all_occurrences(string, substr)

Note: find() method returns -1 when it can't find anything注意: find()方法在找不到任何东西时返回-1

The pythonic way would be: pythonic的方式是:

mystring = 'Hello World, this should work!'
find_all = lambda c,s: [x for x in range(c.find(s), len(c)) if c[x] == s]

# s represents the search string
# c represents the character string

find_all(mystring,'o')    # will return all positions of 'o'

[4, 7, 20, 26] 
>>> 

This is solution of a similar question from hackerrank.这是来自hackerrank的类似问题的解决方案。 I hope this could help you.我希望这可以帮助你。

import re
a = input()
b = input()
if b not in a:
    print((-1,-1))
else:
    #create two list as
    start_indc = [m.start() for m in re.finditer('(?=' + b + ')', a)]
    for i in range(len(start_indc)):
        print((start_indc[i], start_indc[i]+len(b)-1))

Output:输出:

aaadaa
aa
(0, 1)
(1, 2)
(4, 5)

if you only want to use numpy here is a solution如果你只想使用 numpy 这里是一个解决方案

import numpy as np

S= "test test test test"
S2 = 'test'
inds = np.cumsum([len(k)+len(S2) for k in S.split(S2)[:-1]])- len(S2)
print(inds)

def find_index(string, let):
    enumerated = [place  for place, letter in enumerate(string) if letter == let]
    return enumerated

for example :例如 :

find_index("hey doode find d", "d") 

returns:返回:

[4, 7, 13, 15]

Not exactly what OP asked but you could also use the split function to get a list of where all the substrings don't occur.不完全是 OP 的要求,但您也可以使用split 函数来获取所有子字符串出现的列表。 OP didn't specify the end goal of the code but if your goal is to remove the substrings anyways then this could be a simple one-liner. OP 没有指定代码的最终目标,但如果您的目标是无论如何删除子字符串,那么这可能是一个简单的单行。 There are probably more efficient ways to do this with larger strings;使用更大的字符串可能有更有效的方法。 regular expressions would be preferable in that case在这种情况下,正则表达式会更好

# Extract all non-substrings
s = "an-example-string"
s_no_dash = s.split('-')
# >>> s_no_dash
# ['an', 'example', 'string']

# Or extract and join them into a sentence
s_no_dash2 = ' '.join(s.split('-'))
# >>> s_no_dash2
# 'an example string'

Did a brief skim of other answers so apologies if this is already up there.是否简要浏览了其他答案,如果这已经在那里,我们深表歉意。

def count_substring(string, sub_string):
    c=0
    for i in range(0,len(string)-2):
        if string[i:i+len(sub_string)] == sub_string:
            c+=1
    return c

if __name__ == '__main__':
    string = input().strip()
    sub_string = input().strip()
    
    count = count_substring(string, sub_string)
    print(count)

I runned in the same problem and did this:我遇到了同样的问题并这样做了:

hw = 'Hello oh World!'
list_hw = list(hw)
o_in_hw = []

while True:
    o = hw.find('o')
    if o != -1:
        o_in_hw.append(o)
        list_hw[o] = ' '
        hw = ''.join(list_hw)
    else:
        print(o_in_hw)
        break

Im pretty new at coding so you can probably simplify it (and if planned to used continuously of course make it a function).我在编码方面很新,所以你可以简化它(如果计划连续使用,当然让它成为一个功能)。

All and all it works as intended for what i was doing.一切都按照我正在做的事情进行。

Edit: Please consider this is for single characters only, and it will change your variable, so you have to create a copy of the string in a new variable to save it, i didnt put it in the code cause its easy and its only to show how i made it work.编辑:请考虑这仅适用于单个字符,它会改变你的变量,所以你必须在一个新变量中创建一个字符串的副本来保存它,我没有把它放在代码中,因为它很容易而且它只是展示我是如何让它工作的。

if you want to use without re(regex) then:如果你想在没有 re(regex) 的情况下使用,那么:

find_all = lambda _str,_w : [ i for i in range(len(_str)) if _str.startswith(_w,i) ]

string = "test test test test"
print( find_all(string, 'test') ) # >>> [0, 5, 10, 15]
def find_index(word, letter):
    index_list=[]
    for i in range(len(word)) :
        if word[i]==letter:
            index_list.append(i)
    return index_list

index_of_e=find_index('Getacher','e')
print(index_of_e)  # will Give  [1, 6]```

Here's a solution that I came up with, using assignment expression (new feature since Python 3.8):这是我提出的一个解决方案,使用赋值表达式(自 Python 3.8 以来的新功能):

string = "test test test test"
phrase = "test"
start = -1
result = [(start := string.find(phrase, start + 1)) for _ in range(string.count(phrase))]

Output:输出:

[0, 5, 10, 15]

To find all the occurence of a character in a give string and return as a dictionary eg: hello result : {'h':1, 'e':1, 'l':2, 'o':1}查找给定字符串中出现的所有字符并作为字典返回,例如:hello result : {'h':1, 'e':1, 'l':2, 'o':1}

def count(string):
   result = {}
   if(string):
     for i in string:
       result[i] = string.count(i)
     return result
   return {}

or else you do like this否则你会喜欢这样

from collections import Counter

   def count(string):
      return Counter(string)

Try this it worked for me !试试这个它对我有用!

x=input('enter the string')
y=input('enter the substring')
z,r=x.find(y),x.rfind(y)
while z!=r:
        print(z,r,end=' ')
        z=z+len(y)
        r=r-len(y)
        z,r=x.find(y,z,r),x.rfind(y,z,r)

please look at below code请看下面的代码

#!/usr/bin/env python
# coding:utf-8
'''黄哥Python'''


def get_substring_indices(text, s):
    result = [i for i in range(len(text)) if text.startswith(s, i)]
    return result


if __name__ == '__main__':
    text = "How much wood would a wood chuck chuck if a wood chuck could chuck wood?"
    s = 'wood'
    print get_substring_indices(text, s)

By slicing we find all the combinations possible and append them in a list and find the number of times it occurs using count function通过切片,我们找到所有可能的组合并将它们附加到一个列表中,并使用count函数找到它出现的次数

s=input()
n=len(s)
l=[]
f=input()
print(s[0])
for i in range(0,n):
    for j in range(1,n+1):
        l.append(s[i:j])
if f in l:
    print(l.count(f))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Python 中查找所有出现的非连续子字符串? - How to find all occurrences of a non - contiguous substring in Python? python - 在字符串中查找所有出现的带有通配符的子字符串 - python - find all occurrences of substring with wildcards in string 查找字符串中所有出现的分割子字符串 - Find all occurrences of a divided substring in a string 查找所有出现的子字符串(包括重叠)? - Find all occurrences of a substring (including overlap)? 如何仅使用find和replace查找并计算字符串中所有子字符串的出现次数? - How do I find and count all the occurrences of a substring in a string using only find and replace? 如何在字符串中查找 substring 的最大连续出现次数? - How to find maximum consecutive occurrences of a substring in a string? 在向量中查找所有出现的 substring 并将结果保存到另一列 - Find all occurrences of substring in a vector and save results to another column 如何从数据框中的列末尾删除所有出现的子字符串? - How to remove all occurrences of a substring from the end of a column in a dataframe? 如何使用正则表达式替换所有出现的子字符串? - How can I replace all occurrences of a substring using regex? 如何查找列表中所有出现的多个元素? - How to find all occurrences of multiple elements in a list?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM