regular expression: match anything between specific pattern

Question

I am trying to come up with a regular expression that matches a specific pattern by which articles in a text file I have are arranged. (note: "|" indicates paragraph mark/line break, whereas "." indicates some non-word characters.) Here is the pattern

| 
...........................Dokument.1.von.55|
| 
|
|
..........................Some newspaper| 
| 
..........................Freitag 08. Mai 2015 
|
sometextsometextsometextsometextsometextsometextsometextsometextsometextsometext
sometextsometextsometextsometextsometextsometextsometextsometextsometextsometext
(etc..)
|
METAINFO1: IWOIOWIEOWEIWOEIWEO
| 
(etc... possibly more metainfo all capitalized) 
|
| 
.........................Copyright 2015 some publisher notes 
.........................at most one more single line containing copyright information
.........................Alle Rechte vorbehalten| 
# note: last line alternatively: All Rights Reserved 


|
(next pattern i.e. article)

(I had to anonymize it for copyright purposes)

I have created the following regular expression for extracting single articles:

match beginning of the line followed by a line break ^[\\r\\n]
match the line containing "Dokument...." preceded by non-word characters [\\W]+Dokument \\d{1,} von \\d{1,}
match any number of line breaks [\\r\\n]+
match any word and non-word characters (ie the article's text) [\\w\\W]+
match a final newline character (last line before the next pattern starts) [r\\n]
match any non-word characters and the string "Alle Rechte vorbehalten" or "All Rights Reserved" [\\W]+(Alle Rechte vorbehalten|All Rights Reserved)
match end of the line (final line) $

I have tested it with Textpad. When I do a backwards search with the RE it matches any single article (as needed). But when I do a forward search it matches the whole document.

At first I thought it matched any article, which then looked as If it matched everything. But then I tried the replace option with the result that my test term was replaced only once.

So the RE does not do its job. I have been working on this for some time now but can not find my mistake.

What do I do wrong? - Is there an error in my RE?

I intend to match the articles, turn the working RE into a capturing group and then replace it with some xml. But I am stuck here.

Cheers, Andrew

Answer 1

The trick is making the part that matches the body of the article non-greedy and having very clearly defined start and end matches for articles.

re.compile(r'^\n\W+Dokument.+?\n\W+Copyright[^\n]+\n(?:[^\n]+\n)?', flags=re.S)

Just to re-iterate the assumptions:

Starts with a newline, followed by a line with non-word characters followed by "Dokument"
Contains a body full of any characters.
Ends with a newline, followed by a line with non-word characters followed by "Copyright" followed by more characters and a newline.
Can optionally contain one more line of characters followed by a newline.

regular expression: match anything between specific pattern

Question

1 answers

solution1
1 2015-06-05 10:16:02

regular expression: match anything between specific pattern

Question

1 answers

solution1 1 2015-06-05 10:16:02

solution1
1 2015-06-05 10:16:02