I'm looking for a way to extract job numbers out of text using regular expressions with python

Question

If the text is 'Job 45, job 32 and then job 15' I'd like to get a result of ['job 45', 'job 32', 'job 15'] or ['45', '32', '15']

I tried r'[job]\\d+' which returns an empty list.

re.findall(r'[job]\d+', 'Job 45, job 32 and then job 15'.lower())
[]

I experimented with splitting on job.

re.split(r'job','Job 45, job 32 and then job 15'.lower())
['', ' 45, ', ' 32 and then ', ' 15']

I tried splitting on words.

re.findall(r'\w+','Job 45, job 32 and then job 15'.lower())
['job', '45', 'job', '32', 'and', 'then', 'job', '15']

which is workable .. I can check if an element is 'job' and if the following element can be converted to a number.

What would be a regular expression to get either ['job 45', 'job 32', 'job 15'] or ['45', '32', '15'] from 'Job 45, job 32 and then job 15' ?

Answer 1

Your regex [job]\\d+ has couple of problems,

[job] is a character set which means it will match only one character either j or o or b

Second problem, there is no provision of space between job and number in your regex.

Third problem, as your input text contains Job as well as job, so to make a case insensitive match, you need (?i) flag.

So your corrected form of regex becomes this,

(?i)job\s+\d+

Sample python code

import re
s = 'Job 45, job 32 and then job 15';
str = re.findall('(?i)job\s+\d+', s)
print(str)

This gives following output,

['Job 45', 'job 32', 'job 15']

Answer 2

Or much easier using 'job (\\d+)' expression:

>>> re.findall('job (\d+)',s.lower())
['45', '32', '15']
>>>

Answer 3

One approach would be to use the following pattern, which uses a positive lookbehind:

(?<=\bjob )\d+

This captures any group of digits which are immediately preceded by the text job (case insensitive) followed by a single space.

text = "Job 45, job 32 and then job 15"
res = re.findall(r'(?<=\bjob )\d+', text, re.I)
print(res)

['45', '32', '15']