简体   繁体   中英

Python | Regex | get numbers from the text

I have text of the form

Refer to Annex 1.1, 1.2 and 2.0 containing information etc,

or

Refer to Annex 1.0.1, 1.1.1 containing information etc,

I need to extract the numbers that the Annex is referring to. I have tried lookbehind regex as below.

m = re.search("(?<=Annex)\s*[\d+.\d+,]+", text)

print(m)
>>> <re.Match object; span=(11, 15), match=' 1.1'>

I get output as just 1.1, but I don't get remaining. How do I get all the numbers followed by keyword Annex ?

You can use the following two-step solution:

import re
texts = ['Refer to Annex 1.1, 1.2 and 2.0 containing information etc,', 'Refer to Annex 1.0.1, 1.1.1 containing information etc,']
rx = re.compile(r'Annex\s*(\d+(?:(?:\W|and)+\d)*)')
for text in texts:
    match = rx.search(text)
    if match:
        print(re.findall(r'\d+(?:\.\d+)*', match.group(1)) )

See the Python and the regex demo , the output is

['1.1', '1.2', '2.0']
['1.0.1', '1.1.1']

The Annex\s*(\d+(?:(?:\W|and)+\d)*) regex matches

  • Annex - the string Annex
  • \s* - zero or more whitespaces
  • (\d+(?:(?:\W|and)+\d)*) - Group 1: one or more digits and then zero or more occurrences of a non-word char or and string and then a digit.

Then, when the match is found, all dot-separated digit sequences are extracted with \d+(?:\.\d+)* .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM