Use regular expression to extract numbers before specific words

Question

Goal

Extract number before word hours , hour , day , or days

How to use | to match the words?

s = '2 Approximately 5.1 hours 100 ays 1 s'
re.findall(r"([\d.+-/]+)\s*[days|hours]", s) # note I do not know whether string s contains hours or days

return

['5.1', '100', '1']

Since 100 and 1 are not before the exact word hours, they should not show up. Expected

5.1

How to extract the first number from the matched result

s1 = '2 Approximately 10.2 +/- 30hours'
re.findall(r"([\d. +-/]+)\s*hours|\s*hours", s)

return

['10.2 +/- 30']

Expect

10.2

Note that special characters +/-. is optional. When . appears such as 1.3 , 1.3 will need to show up with the . . But when 1 +/- 0.5 happens, 1 will need to be extracted and none of the +/- should be extracted.

I know I could probably do a split and then take the first number

str(re.findall(r"([\d. +-/]+)\s*hours", s1)[0]).split(" ")[1]

Gives

'10.2'

But some of the results only return one number so a split will cause an error. Should I do this with another step or could this be done in one step?

Please note that these strings s1 , s2 are the values in a dataframe. Therefore, iteration using function like apply and lambda will be needed.

Answer 1

In fact, I would use re.findall here:

units = ["hours", "hour", "days", "day"]   # the order matters here: put plurals first
regex = r'(?:' + '|'.join(units) + r')'
s = '2 Approximately 5.1 hours 100 ays 1 s'
values = re.findall(r'\b(\d+(?:\.\d+)?)\s+' + regex, s)
print(values)  # prints [('5.1')]

If you want to also capture the units being used, then make the units alternation capturing , ie use:

regex = r'(' + '|'.join(units) + r')'

Then the output would be:

[('5.1', 'hours')]

Answer 2

Code

import re
units = '|'.join(["hours", "hour", "hrs", "days", "day", "minutes", "minute", "min"])  # possible units
number = '\d+[.,]?\d*'                              # pattern for number
plus_minus = '\+\/\-'                               # plus minus

cases = fr'({number})(?:[\s\d\-\+\/]*)(?:{units})'

pattern = re.compile(cases)

Tests

print(pattern.findall('2 Approximately 5.1 hours 100 ays 1 s'))   
# Output: [5.1]

print(pattern.findall('2 Approximately 10.2 +/- 30hours'))        
# Output: ['10.2']

print(pattern.findall('The mean half-life for Cetuximab is 114 hours (range 75-188 hours).'))        
# Output: ['114', '75']

print(pattern.findall('102 +/- 30 hours in individuals with rheumatoid arthritis and 68 hours in healthy adults.'))        
# Output: ['102', '68']

print(pattern.findall("102 +/- 30 hrs"))                          
# Output: ['102']

print(pattern.findall("102-130 hrs"))                             
# Output: ['102']

print(pattern.findall("102hrs"))                                  
# Output: ['102']

print(pattern.findall("102 hours"))                               
# Output: ['102']

Explanation

Above uses the convenience that raw strings (r'...') and string interpolation f'...' can be combined to:

fr'...'

per PEP 498

The cases strings:

fr'({number})(?:[\s\d\-\+\/]*)(?:{units})'

Parts are sequence:

fr'({number})' - capturing group '(\\d+[.,]?\\d*)' for integers or floats
r'(?:[\\s\\d-+/]*)' - non capturing group for allowable characters between number and units (ie space, +, -, digit, /)
fr'(?:{units})' - non-capturing group for units

Use regular expression to extract numbers before specific words

Question

2 answers

solution1
2 2020-11-12 09:23:11

solution2
1 ACCPTED 2020-11-12 10:47:37

Use regular expression to extract numbers before specific words

Question

2 answers

solution1 2 2020-11-12 09:23:11

solution2 1 ACCPTED 2020-11-12 10:47:37

solution1
2 2020-11-12 09:23:11

solution2
1 ACCPTED 2020-11-12 10:47:37