如何匹配包含特定模式的段落与正则表达式？

Question

I have the following paragraphs : 我有以下段落：

This is paragraph #1
New-York, London, Paris, Berlin
Some other text
End of paragraph

This is paragraph #2
London, Paris
End of paragraph

This is paragraph #3
New-York, Paris, Berlin
Some other text
End of paragraph

This is paragraph #4
End of paragraph

This is paragraph #5
Paris, Berlin
Some other text
End of paragraph

How can I, with a regex, match the paragraphs containing eg New-York (#1 and #3) or London (#1, #2) ? 我如何使用正则表达式匹配包含例如纽约（＃1和＃3）或伦敦（＃1，＃2）的段落？ or even New-York AND Berlin (#1, #3) ? 甚至是纽约和柏林（＃1，＃3）？

I have found an answer in SO 我在SO中找到了答案

How match a paragraph using regex 如何匹配使用正则表达式的段落

which allows me to match the paragraphs (all the text between two blank lines). 这允许我匹配段落（两个空白行之间的所有文本）。

But I cannot figure (my regex skills are… limited) how to match the paragraphs containing a specific pattern, and only those paragraphs. 但我无法想象（我的正则表达式技能是......有限的）如何匹配包含特定模式的段落，只有那些段落。

Thanks in advance for your help 在此先感谢您的帮助

NB : the idea is to use the answer in the Editorial IOS app to fold the answers NOT containing the pattern. 注意：我的想法是使用编辑IOS应用程序中的答案来折叠不包含模式的答案。

Answer 1

I see you might have no access to the Python code itself if you plan to use the pattern in the Editorial iOS app. 如果您计划在编辑iOS应用程序中使用该模式，我发现您可能无法访问Python代码本身。

Then, all I can suggest is the pattern like 然后，我所能提出的就是模式

(?m)^(?=.*(?:\r?\n(?!\r?\n).*)*?\bNew-York\b)(?=.*(?:\r?\n(?!\r?\n).*)*?\bBerlin\b).*(?:\r?\n(?!\r?\n).*)*

See the regex demo . 请参阅正则表达式演示。 Basically, we only match from the start of the line ( ^ with (?m) modifier), we check if there are New-York and Berlin as whole words (due to the \\b word boundaries) anywhere on the lines before the first double line break and if present, match these lines. 基本上，我们只匹配从行的开头（ ^与(?m)修饰符），我们检查是否有New-York和Berlin作为整个单词（由于\\b字边界）在第一行之前的任何地方双线中断，如果存在，则匹配这些线。

Details 细节

(?m)^ - start of the line (?m)^ - 开始行
(?=.*(?:\\r?\\n(?!\\r?\\n).*)*?\\bNew-York\\b) - a positive lookahead that make sure there is a whole word New-York anywhere after 0+ chars other than line break chars ( .* ) optionally followed with 0+ consecutive sequences of CRLF/LF line breaks not followed with another CRLF/LF line breaks followed with the rest of the line (?=.*(?:\\r?\\n(?!\\r?\\n).*)*?\\bNew-York\\b) - 一个积极的前瞻，确保New-York任何地方都有一个完整的单词在除了换行符之外的0+个字符（ .* ）之后，可选地跟随0 +连续的CRLF / LF换行符序列，而不是另一个CRLF / LF换行符，其次是换行符
(?=.*(?:\\r?\\n(?!\\r?\\n).*)*?\\bBerlin\\b) - a whole word Berlin anywhere after 0+ chars other than line break chars ( .* ) optionally followed with 0+ consecutive sequences of CRLF/LF line breaks not followed with another CRLF/LF line breaks followed with the rest of the line (?=.*(?:\\r?\\n(?!\\r?\\n).*)*?\\bBerlin\\b) - 除了换行符之外的0+字符之后的任何地方Berlin的完整字词.* ）任选地跟随0 +连续的CRLF / LF换行序列，然后没有跟随另一个CRLF / LF换行，接着是其余的生产线
.* - match the line .* - 匹配线
(?:\\r?\\n(?!\\r?\\n).*)* - match 0+ consecutive occurrences of: (?:\\r?\\n(?!\\r?\\n).*)* - 匹配连续0次以上：
- \\r?\\n(?!\\r?\\n) - a line break (CRLF or LF) not followed with another CRLF or LF \\r?\\n(?!\\r?\\n) - 换行符（CRLF或LF）未跟随另一个CRLF或LF
- .* - the rest of the line. .* - 其余部分。

Answer 2

Using the newer regex module which supports empty splits: 使用支持空分割的较新的regex模块 ：

import regex as re

string = """
This is paragraph #1
New-York, London, Paris, Berlin
Some other text
End of paragraph

This is paragraph #2
London, Paris
End of paragraph

This is paragraph #3
New-York, Paris, Berlin
Some other text
End of paragraph

This is paragraph #4
End of paragraph

This is paragraph #5
Paris, Berlin
Some other text
End of paragraph
"""

rx = re.compile(r'^$', flags = re.MULTILINE | re.VERSION1)

needle = 'New-York'

interesting = [part 
    for part in rx.split(string)
    if needle in part]

print(interesting)
# ['\nThis is paragraph #1\nNew-York, London, Paris, Berlin\nSome other text\nEnd of paragraph\n', '\nThis is paragraph #3\nNew-York, Paris, Berlin\nSome other text\nEnd of paragraph\n']

Answer 3

I think your specific case requires no regex at all: 我认为你的具体案例根本不需要正则表达式：

[i for i,p in enumerate(mystr.split('\n\n')) if 'New-York' in p or 'London' in p]

In your case resulting in: 在您的情况下导致：

[0, 1, 2]

Obviously an and condition is just as easy, or negating the if . 显然and条件and条件同样容易，或者否定if 。 enumerate is used only if you want the paragraph index. 仅当您需要段落索引时才使用enumerate 。 You don't need it if you want the paragraph itself. 如果你想要段落本身，你不需要它。 No need to force the regex , in any case. 无论如何，无需强制使用regex 。

如何匹配包含特定模式的段落与正则表达式？

问题描述

3 个解决方案

解决方案1
4 已采纳 2017-11-21 14:05:28

解决方案2
1 2017-11-21 13:13:41

解决方案3
0 2017-11-21 15:06:31

如何匹配包含特定模式的段落与正则表达式？

问题描述

3 个解决方案

解决方案1 4 已采纳 2017-11-21 14:05:28

解决方案2 1 2017-11-21 13:13:41

解决方案3 0 2017-11-21 15:06:31

解决方案1
4 已采纳 2017-11-21 14:05:28

解决方案2
1 2017-11-21 13:13:41

解决方案3
0 2017-11-21 15:06:31