简体   繁体   中英

regular expression: match anything between specific pattern

I am trying to come up with a regular expression that matches a specific pattern by which articles in a text file I have are arranged. (note: "|" indicates paragraph mark/line break, whereas "." indicates some non-word characters.) Here is the pattern

| 
...........................Dokument.1.von.55|
| 
|
|
..........................Some newspaper| 
| 
..........................Freitag 08. Mai 2015 
|
sometextsometextsometextsometextsometextsometextsometextsometextsometextsometext
sometextsometextsometextsometextsometextsometextsometextsometextsometextsometext
(etc..)
|
METAINFO1: IWOIOWIEOWEIWOEIWEO
| 
(etc... possibly more metainfo all capitalized) 
|
| 
.........................Copyright 2015 some publisher notes 
.........................at most one more single line containing copyright information
.........................Alle Rechte vorbehalten| 
# note: last line alternatively: All Rights Reserved 


|
(next pattern i.e. article) 

(I had to anonymize it for copyright purposes)

I have created the following regular expression for extracting single articles:

  1. match beginning of the line followed by a line break ^[\\r\\n]
  2. match the line containing "Dokument...." preceded by non-word characters [\\W]+Dokument \\d{1,} von \\d{1,}
  3. match any number of line breaks [\\r\\n]+
  4. match any word and non-word characters (ie the article's text) [\\w\\W]+
  5. match a final newline character (last line before the next pattern starts) [r\\n]
  6. match any non-word characters and the string "Alle Rechte vorbehalten" or "All Rights Reserved" [\\W]+(Alle Rechte vorbehalten|All Rights Reserved)
  7. match end of the line (final line) $

Hence, the whole RE is ^[\\r\\n][\\W]+Dokument \\d{1,} von \\d{1,}[\\r\\n]+[\\w\\W]+[\\r\\n][\\W]+(Alle Rechte vorbehalten|All Rights Reserved)$

I have tested it with Textpad. When I do a backwards search with the RE it matches any single article (as needed). But when I do a forward search it matches the whole document.

At first I thought it matched any article, which then looked as If it matched everything. But then I tried the replace option with the result that my test term was replaced only once.

So the RE does not do its job. I have been working on this for some time now but can not find my mistake.

What do I do wrong? - Is there an error in my RE?

I intend to match the articles, turn the working RE into a capturing group and then replace it with some xml. But I am stuck here.

Cheers, Andrew

The trick is making the part that matches the body of the article non-greedy and having very clearly defined start and end matches for articles.

re.compile(r'^\n\W+Dokument.+?\n\W+Copyright[^\n]+\n(?:[^\n]+\n)?', flags=re.S)

Just to re-iterate the assumptions:

  • Starts with a newline, followed by a line with non-word characters followed by "Dokument"
  • Contains a body full of any characters.
  • Ends with a newline, followed by a line with non-word characters followed by "Copyright" followed by more characters and a newline.
  • Can optionally contain one more line of characters followed by a newline.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM