如何在两个标记之间提取 substring？

Question

Let's say I have a string 'gfgfdAAA1234ZZZuijjk' and I want to extract just the '1234' part.假设我有一个字符串'gfgfdAAA1234ZZZuijjk' ，我只想提取'1234'部分。

I only know what will be the few characters directly before AAA , and after ZZZ the part I am interested in 1234 .我只知道直接在AAA之前的几个字符是什么，在ZZZ之后我对1234感兴趣。

With sed it is possible to do something like this with a string:使用sed可以使用字符串执行以下操作：

echo "$STRING" | sed -e "s|.*AAA\(.*\)ZZZ.*|\1|"

And this will give me 1234 as a result.结果，这将给我1234 。

How to do the same thing in Python?如何在 Python 中做同样的事情？

Answer 1

Using regular expressions - documentation for further reference使用正则表达式 -文档供进一步参考

import re

text = 'gfgfdAAA1234ZZZuijjk'

m = re.search('AAA(.+?)ZZZ', text)
if m:
    found = m.group(1)

# found: 1234

or:或者：

import re

text = 'gfgfdAAA1234ZZZuijjk'

try:
    found = re.search('AAA(.+?)ZZZ', text).group(1)
except AttributeError:
    # AAA, ZZZ not found in the original string
    found = '' # apply your error handling

# found: 1234

Answer 2

>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> start = s.find('AAA') + 3
>>> end = s.find('ZZZ', start)
>>> s[start:end]
'1234'

Answer 3

regular expression正则表达式<\/h3>

import re re.search(r"(?<=AAA).*?(?=ZZZ)", your_text).group(0)<\/code><\/pre> The above as-is will fail with an AttributeError<\/code> if there are no "AAA" and "ZZZ" in your_text<\/code>如果your_text<\/code>中没有“AAA”和“ZZZ”，上述原样将失败并出现AttributeError<\/code>
 string methods字符串方法<\/h3>your_text.partition("AAA")[2].partition("ZZZ")[0]<\/code><\/pre> The above will return an empty string if either "AAA" or "ZZZ" don't exist in your_text<\/code> .如果your_text<\/code>中不存在“AAA”或“ZZZ”，则上述内容将返回一个空字符串。
 PS Python Challenge? PS Python 挑战？
"

Answer 4

Surprised that nobody has mentioned this which is my quick version for one-off scripts:很惊讶没有人提到这是我的一次性脚本的快速版本：

>>> x = 'gfgfdAAA1234ZZZuijjk'
>>> x.split('AAA')[1].split('ZZZ')[0]
'1234'

Answer 5

you can do using just one line of code你可以只使用一行代码

>>> import re

>>> re.findall(r'\d{1,5}','gfgfdAAA1234ZZZuijjk')

>>> ['1234']

Answer 6

import re
print re.search('AAA(.*?)ZZZ', 'gfgfdAAA1234ZZZuijjk').group(1)

Answer 7

You can use re module for that:您可以为此使用re模块：

>>> import re
>>> re.compile(".*AAA(.*)ZZZ.*").match("gfgfdAAA1234ZZZuijjk").groups()
('1234,)

Answer 8

In python, extracting substring form string can be done using findall method in regular expression ( re ) module.在 python 中，可以使用正则表达式 ( re ) 模块中的findall方法来提取子字符串形式的字符串。

>>> import re
>>> s = 'gfgfdAAA1234ZZZuijjk'
>>> ss = re.findall('AAA(.+)ZZZ', s)
>>> print ss
['1234']

Answer 9

With sed it is possible to do something like this with a string:使用 sed 可以用字符串做这样的事情：

echo "$STRING" | sed -e "s|.*AAA\$.*\$ZZZ.*|\\1|"

And this will give me 1234 as a result.结果，这将给我 1234。

You could do the same with re.sub function using the same regex.您可以使用相同的正则表达式对re.sub函数执行相同的操作。

>>> re.sub(r'.*AAA(.*)ZZZ.*', r'\1', 'gfgfdAAA1234ZZZuijjk')
'1234'

In basic sed, capturing group are represented by \$..\$ , but in python it was represented by (..) .在基本 sed 中，捕获组由\$..\$表示，但在 python 中，它由(..)表示。

Answer 10

text = 'I want to find a string between two substrings'
left = 'find a '
right = 'between two'

print(text[text.index(left)+len(left):text.index(right)])

Answer 11

>>> s = '/tmp/10508.constantstring'
>>> s.split('/tmp/')[1].split('constantstring')[0].strip('.')

Answer 12

You can find first substring with this function in your code (by character index).您可以在代码中使用此函数找到第一个子字符串（按字符索引）。 Also, you can find what is after a substring.此外，您可以找到子字符串之后的内容。

def FindSubString(strText, strSubString, Offset=None):
    try:
        Start = strText.find(strSubString)
        if Start == -1:
            return -1 # Not Found
        else:
            if Offset == None:
                Result = strText[Start+len(strSubString):]
            elif Offset == 0:
                return Start
            else:
                AfterSubString = Start+len(strSubString)
                Result = strText[AfterSubString:AfterSubString + int(Offset)]
            return Result
    except:
        return -1

# Example:

Text = "Thanks for contributing an answer to Stack Overflow!"
subText = "to"

print("Start of first substring in a text:")
start = FindSubString(Text, subText, 0)
print(start); print("")

print("Exact substring in a text:")
print(Text[start:start+len(subText)]); print("")

print("What is after substring \"%s\"?" %(subText))
print(FindSubString(Text, subText))

# Your answer:

Text = "gfgfdAAA1234ZZZuijjk"
subText1 = "AAA"
subText2 = "ZZZ"

AfterText1 = FindSubString(Text, subText1, 0) + len(subText1)
BeforText2 = FindSubString(Text, subText2, 0) 

print("\nYour answer:\n%s" %(Text[AfterText1:BeforText2]))

Answer 13

Just in case somebody will have to do the same thing that I did.以防万一有人不得不做和我一样的事情。 I had to extract everything inside parenthesis in a line.我必须在一行中提取括号内的所有内容。 For example, if I have a line like 'US president (Barack Obama) met with ...' and I want to get only 'Barack Obama' this is solution:例如，如果我有一个像“美国总统（巴拉克奥巴马）会见......”这样的台词，而我只想得到“巴拉克奥巴马”，这就是解决方案：

regex = '.*\((.*?)\).*'
matches = re.search(regex, line)
line = matches.group(1) + '\n'

Ie you need to block parenthesis with slash \\ sign.即你需要用slash \\符号来阻止括号。 Though it is a problem about more regular expressions that Python.虽然这是一个关于 Python 更多正则表达式的问题。

Also, in some cases you may see 'r' symbols before regex definition.此外，在某些情况下，您可能会在正则表达式定义之前看到“r”符号。 If there is no r prefix, you need to use escape characters like in C. Here is more discussion on that.如果没有 r 前缀，则需要像 C 中那样使用转义字符。这里有更多讨论。

Answer 14

Using PyParsing使用 PyParsing

import pyparsing as pp

word = pp.Word(pp.alphanums)

s = 'gfgfdAAA1234ZZZuijjk'
rule = pp.nestedExpr('AAA', 'ZZZ')
for match in rule.searchString(s):
    print(match)

which yields:产生：

[['1234']]

Answer 15

一个带有 Python 3.8 的班轮：

text[text.find(start:='AAA')+len(start):text.find('ZZZ')]

Answer 16

Here's a solution without regex that also accounts for scenarios where the first substring contains the second substring.这是一个没有正则表达式的解决方案，它还考虑了第一个子字符串包含第二个子字符串的情况。 This function will only find a substring if the second marker is after the first marker.如果第二个标记在第一个标记之后，此函数只会查找子字符串。

def find_substring(string, start, end):
    len_until_end_of_first_match = string.find(start) + len(start)
    after_start = string[len_until_end_of_first_match:]
    return string[string.find(start) + len(start):len_until_end_of_first_match + after_start.find(end)]

Answer 17

Another way of doing it is using lists (supposing the substring you are looking for is made of numbers, only) :另一种方法是使用列表（假设您要查找的子字符串仅由数字组成）：

string = 'gfgfdAAA1234ZZZuijjk'
numbersList = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
output = []

for char in string:
    if char in numbersList: output.append(char)

print(f"output: {''.join(output)}")
### output: 1234

Answer 18

Typescript.打字稿。 Gets string in between two other strings.获取其他两个字符串之间的字符串。
<\/blockquote>

Searches shortest string between prefixes and postfixes搜索前缀和后缀之间的最短字符串

prefixes - string \/ array of strings \/ null (means search from the start). prefixes - 字符串\/字符串数组\/null（表示从头开始搜索）。

postfixes - string \/ array of strings \/ null (means search until the end).后缀 - 字符串\/字符串数组\/空（表示搜索到最后）。
 public getStringInBetween(str: string, prefixes: string | string[] | null, postfixes: string | string[] | null): string { if (typeof prefixes === 'string') { prefixes = [prefixes]; } if (typeof postfixes === 'string') { postfixes = [postfixes]; } if (!str || str.length < 1) { throw new Error(str + ' should contain ' + prefixes); } let start = prefixes === null ? { pos: 0, sub: '' } : this.indexOf(str, prefixes); const end = postfixes === null ? { pos: str.length, sub: '' } : this.indexOf(str, postfixes, start.pos + start.sub.length); let value = str.substring(start.pos + start.sub.length, end.pos); if (!value || value.length < 1) { throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes); } while (true) { try { start = this.indexOf(value, prefixes); } catch (e) { break; } value = value.substring(start.pos + start.sub.length); if (!value || value.length < 1) { throw new Error(str + ' should contain string in between ' + prefixes + ' and ' + postfixes); } } return value; }<\/code><\/pre>"

Answer 19

also, you can find all combinations in the bellow function此外，您可以在波纹管功能中找到所有组合

s = 'Part 1. Part 2. Part 3 then more text'
def find_all_places(text,word):
    word_places = []
    i=0
    while True:
        word_place = text.find(word,i)
        i+=len(word)+word_place
        if i>=len(text):
            break
        if word_place<0:
            break
        word_places.append(word_place)
    return word_places
def find_all_combination(text,start,end):
    start_places = find_all_places(text,start)
    end_places = find_all_places(text,end)
    combination_list = []
    for start_place in start_places:
        for end_place in end_places:
            print(start_place)
            print(end_place)
            if start_place>=end_place:
                continue
            combination_list.append(text[start_place:end_place])
    return combination_list
find_all_combination(s,"Part","Part")

result:结果：

['Part 1. ', 'Part 1. Part 2. ', 'Part 2. ']

Answer 20

In case you want to look for multiple occurences.如果您想查找多次出现的情况。

content ="Prefix_helloworld_Suffix_stuff_Prefix_42_Suffix_andsoon"
strings = []
for c in content.split('Prefix_'):
    spos = c.find('_Suffix')
    if spos!=-1:
        strings.append( c[:spos])
print( strings )

Or more quickly:或者更快：

strings = [ c[:c.find('_Suffix')] for c in content.split('Prefix_') if c.find('_Suffix')!=-1 ]

Answer 21

One liners that return other string if there was no match.如果没有匹配，则返回其他字符串的一个衬里。 Edit: improved version uses next function, replace "not-found" with something else if needed:编辑：改进版本使用next功能，如果需要，将"not-found"替换为其他内容：

import re
res = next( (m.group(1) for m in [re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk" ),] if m), "not-found" )

My other method to do this, less optimal, uses regex 2nd time, still didn't found a shorter way:我这样做的另一种方法，不太理想，第二次使用正则表达式，仍然没有找到更短的方法：

import re
res = ( ( re.search("AAA(.*?)ZZZ", "gfgfdAAA1234ZZZuijjk") or re.search("()","") ).group(1) )

如何在两个标记之间提取 substring？

问题描述

21 个解决方案

解决方案1
775 已采纳 2011-01-12 09:18:56

解决方案2
140 2011-01-12 09:17:23

解决方案3
102 2011-02-06 23:43:17

解决方案4
27 2019-02-09 16:57:58

解决方案5
17 2018-01-11 11:39:55

解决方案6
15 2011-01-12 09:18:00

解决方案7
8 2011-01-12 09:19:21

解决方案8
6 2018-03-14 09:11:23

解决方案9
5 2015-01-31 08:29:21

解决方案10
5 2019-03-04 01:31:31

解决方案11
4 2014-02-08 00:12:43

解决方案12
4 2017-10-14 09:22:26

解决方案13
3 2014-01-19 19:29:00

解决方案14
2 2020-01-08 23:03:56

解决方案15
2 2021-06-18 19:20:35

解决方案16
0 2019-02-23 18:26:39

解决方案17
0 2019-10-12 00:30:49

解决方案18
0 2020-09-04 11:16:46

解决方案19
0 2021-10-05 19:02:30

解决方案20
0 2022-08-02 13:28:35

解决方案21
-1 2017-12-07 00:55:20

如何在两个标记之间提取 substring？

问题描述

21 个解决方案

解决方案1 775 已采纳 2011-01-12 09:18:56

解决方案2 140 2011-01-12 09:17:23

解决方案3 102 2011-02-06 23:43:17

解决方案4 27 2019-02-09 16:57:58

解决方案5 17 2018-01-11 11:39:55

解决方案6 15 2011-01-12 09:18:00

解决方案7 8 2011-01-12 09:19:21

解决方案8 6 2018-03-14 09:11:23

解决方案9 5 2015-01-31 08:29:21

解决方案10 5 2019-03-04 01:31:31

解决方案11 4 2014-02-08 00:12:43

解决方案12 4 2017-10-14 09:22:26

解决方案13 3 2014-01-19 19:29:00

解决方案14 2 2020-01-08 23:03:56

解决方案15 2 2021-06-18 19:20:35

解决方案16 0 2019-02-23 18:26:39

解决方案17 0 2019-10-12 00:30:49

解决方案18 0 2020-09-04 11:16:46

解决方案19 0 2021-10-05 19:02:30

解决方案20 0 2022-08-02 13:28:35

解决方案21 -1 2017-12-07 00:55:20

解决方案1
775 已采纳 2011-01-12 09:18:56

解决方案2
140 2011-01-12 09:17:23

解决方案3
102 2011-02-06 23:43:17

解决方案4
27 2019-02-09 16:57:58

解决方案5
17 2018-01-11 11:39:55

解决方案6
15 2011-01-12 09:18:00

解决方案7
8 2011-01-12 09:19:21

解决方案8
6 2018-03-14 09:11:23

解决方案9
5 2015-01-31 08:29:21

解决方案10
5 2019-03-04 01:31:31

解决方案11
4 2014-02-08 00:12:43

解决方案12
4 2017-10-14 09:22:26

解决方案13
3 2014-01-19 19:29:00

解决方案14
2 2020-01-08 23:03:56

解决方案15
2 2021-06-18 19:20:35

解决方案16
0 2019-02-23 18:26:39

解决方案17
0 2019-10-12 00:30:49

解决方案18
0 2020-09-04 11:16:46

解决方案19
0 2021-10-05 19:02:30

解决方案20
0 2022-08-02 13:28:35

解决方案21
-1 2017-12-07 00:55:20