提取两个标记之间的所有子字符串

Question

我有一个字符串：

mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"

我想要的是标记start="&maker1"和end="/\n"之间的子字符串列表。 因此，预期的结果是：

whatIwant = ["The String that I want", "Another string that I want"]

我在这里阅读了答案：

并尝试了这个但没有成功，

>>> import re
>>> mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
>>> whatIwant = re.search("&marker1(.*)/\n", mystr)
>>> whatIwant.group(1)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'group'

我能做些什么来解决这个问题？ 另外，我有一个很长的字符串

>>> len(myactualstring)
7792818

Answer 1

我能做些什么来解决这个问题？ 我会做：

import re
mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
found = re.findall(r"\&marker1\n(.*?)/\n", mystr)
print(found)

Output：

['The String that I want ', 'Another string that I want ']

注意：

&在re模式中有特殊含义，如果你想要文字 & 你需要转义它（ \& ）
. 匹配除换行符以外的任何内容
如果您只想要匹配的子字符串列表，而不是search ， findall更适合选择
*? 是非贪婪的，在这种情况下.*也可以，因为. 不匹配换行符，但在其他情况下，您可能会比您希望的结束匹配更多
我使用所谓的原始字符串（r 前缀）使 escaping 更容易

阅读模块re文档以讨论原始字符串的使用和具有特殊含义的隐式字符列表。

Answer 2

使用re.findall考虑这个选项：

mystr = "&marker1\nThe String that I want /\n&marker1\nAnother string that I want /\n"
matches = re.findall(r'&marker1\n(.*?)\s*/\n', mystr)
print(matches)

这打印：

['The String that I want', 'Another string that I want']

以下是正则表达式模式的解释：

&marker1      match a marker
\n            newline
(.*?)         match AND capture all content until reaching the first
\s*           optional whitespace, followed by
/\n           / and newline

请注意， re.findall只会捕获(...)捕获组中出现的内容，这是您要提取的内容。

提取两个标记之间的所有子字符串

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-06-12 10:47:17

解决方案2
1 2020-06-12 10:47:01

提取两个标记之间的所有子字符串

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-06-12 10:47:17

解决方案2 1 2020-06-12 10:47:01

解决方案1
2 已采纳 2020-06-12 10:47:17

解决方案2
1 2020-06-12 10:47:01