[英]How to match paragraphs containing a specific pattern with regex?
I have the following paragraphs : 我有以下段落:
This is paragraph #1
New-York, London, Paris, Berlin
Some other text
End of paragraph
This is paragraph #2
London, Paris
End of paragraph
This is paragraph #3
New-York, Paris, Berlin
Some other text
End of paragraph
This is paragraph #4
End of paragraph
This is paragraph #5
Paris, Berlin
Some other text
End of paragraph
How can I, with a regex, match the paragraphs containing eg New-York (#1 and #3) or London (#1, #2) ? 我如何使用正则表达式匹配包含例如纽约(#1和#3)或伦敦(#1,#2)的段落? or even New-York AND Berlin (#1, #3) ?
甚至是纽约和柏林(#1,#3)?
I have found an answer in SO 我在SO中找到了答案
How match a paragraph using regex 如何匹配使用正则表达式的段落
which allows me to match the paragraphs (all the text between two blank lines). 这允许我匹配段落(两个空白行之间的所有文本)。
But I cannot figure (my regex skills are… limited) how to match the paragraphs containing a specific pattern, and only those paragraphs. 但我无法想象(我的正则表达式技能是......有限的)如何匹配包含特定模式的段落,只有那些段落。
Thanks in advance for your help 在此先感谢您的帮助
NB : the idea is to use the answer in the Editorial IOS app to fold the answers NOT containing the pattern. 注意:我的想法是使用编辑IOS应用程序中的答案来折叠不包含模式的答案。
I see you might have no access to the Python code itself if you plan to use the pattern in the Editorial iOS app. 如果您计划在编辑iOS应用程序中使用该模式,我发现您可能无法访问Python代码本身。
Then, all I can suggest is the pattern like 然后,我所能提出的就是模式
(?m)^(?=.*(?:\r?\n(?!\r?\n).*)*?\bNew-York\b)(?=.*(?:\r?\n(?!\r?\n).*)*?\bBerlin\b).*(?:\r?\n(?!\r?\n).*)*
See the regex demo . 请参阅正则表达式演示 。 Basically, we only match from the start of the line (
^
with (?m)
modifier), we check if there are New-York
and Berlin
as whole words (due to the \\b
word boundaries) anywhere on the lines before the first double line break and if present, match these lines. 基本上,我们只匹配从行的开头(
^
与(?m)
修饰符),我们检查是否有New-York
和Berlin
作为整个单词(由于\\b
字边界)在第一行之前的任何地方双线中断,如果存在,则匹配这些线。
Details 细节
(?m)^
- start of the line (?m)^
- 开始行 (?=.*(?:\\r?\\n(?!\\r?\\n).*)*?\\bNew-York\\b)
- a positive lookahead that make sure there is a whole word New-York
anywhere after 0+ chars other than line break chars ( .*
) optionally followed with 0+ consecutive sequences of CRLF/LF line breaks not followed with another CRLF/LF line breaks followed with the rest of the line (?=.*(?:\\r?\\n(?!\\r?\\n).*)*?\\bNew-York\\b)
- 一个积极的前瞻,确保New-York
任何地方都有一个完整的单词在除了换行符之外的0+个字符( .*
)之后,可选地跟随0 +连续的CRLF / LF换行符序列,而不是另一个CRLF / LF换行符,其次是换行符 (?=.*(?:\\r?\\n(?!\\r?\\n).*)*?\\bBerlin\\b)
- a whole word Berlin
anywhere after 0+ chars other than line break chars ( .*
) optionally followed with 0+ consecutive sequences of CRLF/LF line breaks not followed with another CRLF/LF line breaks followed with the rest of the line (?=.*(?:\\r?\\n(?!\\r?\\n).*)*?\\bBerlin\\b)
- 除了换行符之外的0+字符之后的任何地方Berlin
的完整字词.*
)任选地跟随0 +连续的CRLF / LF换行序列,然后没有跟随另一个CRLF / LF换行,接着是其余的生产线 .*
- match the line .*
- 匹配线 (?:\\r?\\n(?!\\r?\\n).*)*
- match 0+ consecutive occurrences of: (?:\\r?\\n(?!\\r?\\n).*)*
- 匹配连续0次以上:
\\r?\\n(?!\\r?\\n)
- a line break (CRLF or LF) not followed with another CRLF or LF \\r?\\n(?!\\r?\\n)
- 换行符(CRLF或LF)未跟随另一个CRLF或LF .*
- the rest of the line. .*
- 其余部分。 Using the newer regex
module which supports empty splits: 使用支持空分割的较新的
regex
模块 :
import regex as re
string = """
This is paragraph #1
New-York, London, Paris, Berlin
Some other text
End of paragraph
This is paragraph #2
London, Paris
End of paragraph
This is paragraph #3
New-York, Paris, Berlin
Some other text
End of paragraph
This is paragraph #4
End of paragraph
This is paragraph #5
Paris, Berlin
Some other text
End of paragraph
"""
rx = re.compile(r'^$', flags = re.MULTILINE | re.VERSION1)
needle = 'New-York'
interesting = [part
for part in rx.split(string)
if needle in part]
print(interesting)
# ['\nThis is paragraph #1\nNew-York, London, Paris, Berlin\nSome other text\nEnd of paragraph\n', '\nThis is paragraph #3\nNew-York, Paris, Berlin\nSome other text\nEnd of paragraph\n']
I think your specific case requires no regex at all: 我认为你的具体案例根本不需要正则表达式:
[i for i,p in enumerate(mystr.split('\n\n')) if 'New-York' in p or 'London' in p]
In your case resulting in: 在您的情况下导致:
[0, 1, 2]
Obviously an and
condition is just as easy, or negating the if
. 显然
and
条件and
条件同样容易,或者否定if
。 enumerate
is used only if you want the paragraph index. 仅当您需要段落索引时才使用
enumerate
。 You don't need it if you want the paragraph itself. 如果你想要段落本身,你不需要它。 No need to force the
regex
, in any case. 无论如何,无需强制使用
regex
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.