简体   繁体   English

python开发非贪婪正则表达式以匹配特定模式几次

[英]python develop non-greedy regex to match specific pattern several times

I am about to develop a regex for a pattern given in a file I want to process. 我将要为要处理的文件中给出的模式开发一个正则表达式。

The file contains several articles, which all follow a similar pattern: 该文件包含几篇文章,所有文章都遵循类似的模式:

  1. start with a line ie newline 从一行开始,即换行
  2. then have some non-word characters on a line followed by "Dokument xx von xx" and a newline 然后在一行上包含一些非单词字符,后跟“ Dokument xx von xx”和换行符
  3. that is followed by a body of characters 然后是一个字符体
  4. ends with two newlines, followed by a line with non-word characters followed by "Copyright" followed by more characters and a new line 以两个换行符结尾,然后是非单词字符行,然后是“版权”,然后是更多字符和换行符
  5. one optional line containing non-word characters followed by more characters and a new line 一条包含非单词字符的可选行,后接更多字符和新行
  6. finally one line containing non-word characters followed by either "All Rights Reserved" or "Alle Rechte vorbehalten" and a new line 最后一行包含非单词字符,后跟“保留所有权利”或“ Alle Rechte vorbehalten”和新行

I try to come up with a non-greedy regex, that accurately matches the start, body, and end of the article(s). 我尝试提出一个非贪婪的正则表达式,该表达式与文章的开头,正文和结尾准确匹配。

For 1-4 I have ^n\\W+Dokument.+?[\\r\\n][\\r\\n]\\W+Copyright[^\\n]+\\n 对于1-4,我有^n\\W+Dokument.+?[\\r\\n][\\r\\n]\\W+Copyright[^\\n]+\\n

What is necessary for 5-6? 5-6需要什么?

Do I actually need a dotall flag if I aim to use this regex as proposed to match the pattern several times in a file? 如果我打算按照建议的方式使用此正则表达式来在文件中多次匹配模式,是否真的需要dotall标志?

I have been on this all day. 我整天都在忙。 Can someone with a fresh mind show me the missing bits? 可以让有新主意的人告诉我缺失的地方吗?

Cheers, Andrew 干杯,安德鲁

You can use the following: 您可以使用以下内容:

  1. one optional line containing non-word characters followed by more characters and a new line 一条包含非单词字符的可选行,后接更多字符和新行
(\W+?(?:(?!All|Alle).)+?\n)?
  1. one line containing non-word characters followed by either "All Rights Reserved" or "Alle Rechte vorbehalten" and a new line 一行包含非单词字符,后跟“保留所有权利”或“ Alle Rechte vorbehalten”和换行
\W+(All Rights Reserved|Alle Rechte vorbehalten)\n

Combining 1-6: 结合1-6:

^\W+Dokument.+?[\r\n][\r\n]\W+Copyright[^\n]+\n(\W+?(?:(?!All|Alle).)+?\n)?\W+?(?:All Rights Reserved|Alle Rechte vorbehalten)\n

See DEMO 演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM