I am trying to come up with a regular expression that matches a specific pattern by which articles in a text file I have are arranged. (note: "|" indicates paragraph mark/line break, whereas "." indicates some non-word characters.) Here is the pattern
|
...........................Dokument.1.von.55|
|
|
|
..........................Some newspaper|
|
..........................Freitag 08. Mai 2015
|
sometextsometextsometextsometextsometextsometextsometextsometextsometextsometext
sometextsometextsometextsometextsometextsometextsometextsometextsometextsometext
(etc..)
|
METAINFO1: IWOIOWIEOWEIWOEIWEO
|
(etc... possibly more metainfo all capitalized)
|
|
.........................Copyright 2015 some publisher notes
.........................at most one more single line containing copyright information
.........................Alle Rechte vorbehalten|
# note: last line alternatively: All Rights Reserved
|
(next pattern i.e. article)
(I had to anonymize it for copyright purposes)
I have created the following regular expression for extracting single articles:
^[\\r\\n]
[\\W]+Dokument \\d{1,} von \\d{1,}
[\\r\\n]+
[\\w\\W]+
[r\\n]
[\\W]+(Alle Rechte vorbehalten|All Rights Reserved)
$
Hence, the whole RE is ^[\\r\\n][\\W]+Dokument \\d{1,} von \\d{1,}[\\r\\n]+[\\w\\W]+[\\r\\n][\\W]+(Alle Rechte vorbehalten|All Rights Reserved)$
I have tested it with Textpad. When I do a backwards search with the RE it matches any single article (as needed). But when I do a forward search it matches the whole document.
At first I thought it matched any article, which then looked as If it matched everything. But then I tried the replace option with the result that my test term was replaced only once.
So the RE does not do its job. I have been working on this for some time now but can not find my mistake.
What do I do wrong? - Is there an error in my RE?
I intend to match the articles, turn the working RE into a capturing group and then replace it with some xml. But I am stuck here.
Cheers, Andrew
The trick is making the part that matches the body of the article non-greedy and having very clearly defined start and end matches for articles.
re.compile(r'^\n\W+Dokument.+?\n\W+Copyright[^\n]+\n(?:[^\n]+\n)?', flags=re.S)
Just to re-iterate the assumptions:
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.