I'm pretty new to Regex, and so I am sure I am missing something obvious, but need a hand with the following problem.
I want to extract the string(s) that follows on from a specific substring. I am working off a list of scanned documents and have the following example string and I want to extract everything after "FORENAME"
This is what I have done so far:
regex = r"(?<=(FORE))[A-Z]+"
test_str = 'UNIQUE NUMBER 12345 678910 11 FROM THIS DOCUMENT | . ISSUED ON 2011-04-04 FORENAME GUIDO \\ SURNAME VAN ROSSUM. '
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches, start=1):
print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
for groupNum in range(0, len(match.groups())):
groupNum = groupNum + 1
print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
Which returns the following:
Match 1 was found at 78-82: NAME
Group 1 found at 74-78: FORE
What I want it to return is:
GUIDO \ SURNAME VAN ROSSUM.
Thanks!
What I want it to return is:
GUIDO \ SURNAME VAN ROSSUM.
Based on the above, you can use:
import re
test_str = 'UNIQUE NUMBER 12345 678910 11 FROM THIS DOCUMENT | . ISSUED ON 2011-04-04 FORENAME GUIDO \\ SURNAME VAN ROSSUM.'
result = re.sub(r"^.*FORENAME(.*?)$", r"\1", test_str)
print(result)
# GUIDO \ SURNAME VAN ROSSUM.
You don't need regex for so simple problem
test_str = 'UNIQUE NUMBER 12345 678910 11 FROM THIS DOCUMENT | . ISSUED ON 2011-04-04 FORENAME GUIDO \\ SURNAME VAN ROSSUM. '
pos = test_str.find("FORENAME") + len("FORENAME")
print(test_str[pos:])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.