I have to parse a PDF document and I'm using PyPDF2 with re(regex).
The file includes several lines like the one below:
18-02-202010:44:48PEDMILANO OVEST- BINASCOA1,40
I need to extract from this line the text( bold ) between the time and the amount:
PEDMILANO OVEST- BINASCOA
The following code is working but sometimes this code doesn't find anything since can be a number between these chars, for example, 18-02-202010:44:48PEDMILANO OVE3ST- BINASCOA1,40
.
regex = re.compile(r'\d\d-\d\d-\d\d\d\d\d\d:\d\d:\d\d\D+\d+,\d\d')
Is there a way to include a number in this regular expression?
The following should simplify the current regex:
import re
s = '18-02-202010:44:48PEDMILANO OVE3ST- BINASCOA1,40'
re.search(r'\:\d+([A-Z].*?)(?=\d+\,\d+$)', s).group(1)
# 'PEDMILANO OVE3ST- BINASCOA'
See demo
\\d+([AZ].*?)(?=\\d+\\,\\d+$)
\\
: matches the character : literally (case sensitive) \\d+
: matches a digit (equal to [0-9]
) +
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) ([AZ].*?)
Match a single character present in the list below [AZ]
AZ
a single character in the range between A (index 65) and Z (index 90) (case sensitive) .*?
matches any character (except for line terminators)*?
Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)\\d+
matches a digit (equal to [0-9]
) +
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) \\
, matches the character , literally (case sensitive) \\d+
matches a digit (equal to [0-9]
) +
Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) $
asserts position at the end of a line I suggest using
import re
text = "18-02-202010:44:48PEDMILANO OVEST- BINASCOA1,40"
print( re.sub(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}(.*?)\d+(?:,\d+)?$', r'\1', text) )
It can also be written as
re.sub(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}|\d+(?:,\d+)?$', '', text)
Or, if you prefer matching and capturing:
m = re.search(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}(.*?)\d+(?:,\d+)?$', text)
if m:
print( m.group(1) )
See an online Python demo . With this solution, your data may start with any char, and will contain any char (excluding line break chars, since your data is on single lines).
Regex details
^
- start of string \\d{2}-\\d{2}-\\d{5,6}:\\d{2}:\\d{2}
- datetime string: two digits, -
, two digits, -
, five or six digits, :
, two digits, :
two digits (.*?)
- Group 1: any zero or more chars other than line break chars, as few as possible \\d+(?:,\\d+)?
- an int/float value pattern: 1+ digits followed with an optional sequence of ,
and 1+ digits $
- end of string. See the regex demo .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.