简体   繁体   中英

Extracting a section of a string using regex with repeating ending words

I am attempting to extract some some raw strings using re module in python. The end of a to-be-extracted section is identified by a repeating word (repeated multiple times), Current efforts always captures the last match of the repeating word. How can I modify this behavior?

A textfile has been extracted from a pdf. The entire PDF is stored as one string. A general formatting of the string is as below:

*"***Start of notes: Collection of alphanumeric words and characters EndofsectionTopic A: string of words Endofsection"*

The intended string to be captured is: "Collection of alphanumeric words and characters"

The attempted solution used in this situation was: " re.compile(r"*{3}Start of notes:(.+)\\sEndofsection")

This attempt tends to match the whole string rather than just "Collection of alphanumeric words and characters" as intended.

One possible approach is to split with Endofsection and then extract the string from the first section only - this works, but I was hoping to find a more elegant solution using re.compile.

Two problems in your regex,

  • You need to escape * as it is a meta character as \\*
  • Second, you are using (.+) which is a greedy quantifier and will try matching as much as possible, but since you want the shortest match, you need to just change it to (.+?)

Fixing these two issues, gives you the correct intended match.

Regex Demo

Python code,

import re

s = "***Start of notes: Collection of alphanumeric words and characters EndofsectionTopic A: string of words Endofsection"
m = re.search(r'\*{3}Start of notes:(.+?)\sEndofsection', s)
if m:
 print(m.group(1))

Prints,

 Collection of alphanumeric words and characters

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM