简体   繁体   中英

Regex include only one digit between chars

I have to parse a PDF document and I'm using PyPDF2 with re(regex).

The file includes several lines like the one below:

18-02-202010:44:48PEDMILANO OVEST- BINASCOA1,40

I need to extract from this line the text( bold ) between the time and the amount:

PEDMILANO OVEST- BINASCOA

The following code is working but sometimes this code doesn't find anything since can be a number between these chars, for example, 18-02-202010:44:48PEDMILANO OVE3ST- BINASCOA1,40 .

regex = re.compile(r'\d\d-\d\d-\d\d\d\d\d\d:\d\d:\d\d\D+\d+,\d\d')

Is there a way to include a number in this regular expression?

The following should simplify the current regex:

import re

s = '18-02-202010:44:48PEDMILANO OVE3ST- BINASCOA1,40'

re.search(r'\:\d+([A-Z].*?)(?=\d+\,\d+$)', s).group(1)
# 'PEDMILANO OVE3ST- BINASCOA'

See demo

  • \\d+([AZ].*?)(?=\\d+\\,\\d+$)

    • \\ : matches the character : literally (case sensitive)
    • \\d+ : matches a digit (equal to [0-9] )
    • + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
    • 1st Capturing Group ([AZ].*?) Match a single character present in the list below [AZ]
      • AZ a single character in the range between A (index 65) and Z (index 90) (case sensitive)
      • .*? matches any character (except for line terminators)
      • *? Quantifier — Matches between zero and unlimited times, as few times as possible, expanding as needed (lazy)
    • Positive Lookahead (?=\\d+\\,\\d+$) Assert that the Regex below matches
      • \\d+ matches a digit (equal to [0-9] )
      • + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy) \\ , matches the character , literally (case sensitive)
    • \\d+ matches a digit (equal to [0-9] )
    • + Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
    • $ asserts position at the end of a line

I suggest using

import re
text = "18-02-202010:44:48PEDMILANO OVEST- BINASCOA1,40"
print( re.sub(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}(.*?)\d+(?:,\d+)?$', r'\1', text) )

It can also be written as

re.sub(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}|\d+(?:,\d+)?$', '', text)

Or, if you prefer matching and capturing:

m = re.search(r'^\d{2}-\d{2}-\d{5,6}:\d{2}:\d{2}(.*?)\d+(?:,\d+)?$', text)
if m:
    print( m.group(1) )

See an online Python demo . With this solution, your data may start with any char, and will contain any char (excluding line break chars, since your data is on single lines).

Regex details

  • ^ - start of string
  • \\d{2}-\\d{2}-\\d{5,6}:\\d{2}:\\d{2} - datetime string: two digits, - , two digits, - , five or six digits, : , two digits, : two digits
  • (.*?) - Group 1: any zero or more chars other than line break chars, as few as possible
  • \\d+(?:,\\d+)? - an int/float value pattern: 1+ digits followed with an optional sequence of , and 1+ digits
  • $ - end of string.

See the regex demo .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM