使用正則表達式提取特定單詞前的數字

Question

目標

在單詞hours 、 hour 、 day或days之前提取數字

如何使用| 匹配的話？

s = '2 Approximately 5.1 hours 100 ays 1 s'
re.findall(r"([\d.+-/]+)\s*[days|hours]", s) # note I do not know whether string s contains hours or days

返回

['5.1', '100', '1']

由於 100 和 1 不在確切的單詞 hours 之前，因此它們不應出現。 預期的

5.1

如何從匹配結果中提取第一個數字

s1 = '2 Approximately 10.2 +/- 30hours'
re.findall(r"([\d. +-/]+)\s*hours|\s*hours", s)

返回

['10.2 +/- 30']

預計

10.2

請注意特殊字符+/-. 是可選的。 當. 出現如1.3 ， 1.3 將需要與. . 但是當1 +/- 0.5發生時，需要提取 1 並且不應提取任何+/- 。

我知道我可能會拆分然后取第一個數字

str(re.findall(r"([\d. +-/]+)\s*hours", s1)[0]).split(" ")[1]

給

'10.2'

但是有些結果只返回一個數字，所以拆分會導致錯誤。 我應該用另一個步驟來做這件事還是可以一步完成？

請注意，這些字符串s1 、 s2是數據幀中的值。 因此，將需要使用諸如apply和lambda類的函數進行迭代。

Answer 1

事實上，我會在這里使用re.findall ：

units = ["hours", "hour", "days", "day"]   # the order matters here: put plurals first
regex = r'(?:' + '|'.join(units) + r')'
s = '2 Approximately 5.1 hours 100 ays 1 s'
values = re.findall(r'\b(\d+(?:\.\d+)?)\s+' + regex, s)
print(values)  # prints [('5.1')]

如果你也想抓住正在使用的單位，然后進行單位交替捕捉，即使用：

regex = r'(' + '|'.join(units) + r')'

那么輸出將是：

[('5.1', 'hours')]

Answer 2

代碼

import re
units = '|'.join(["hours", "hour", "hrs", "days", "day", "minutes", "minute", "min"])  # possible units
number = '\d+[.,]?\d*'                              # pattern for number
plus_minus = '\+\/\-'                               # plus minus

cases = fr'({number})(?:[\s\d\-\+\/]*)(?:{units})'

pattern = re.compile(cases)

測試

print(pattern.findall('2 Approximately 5.1 hours 100 ays 1 s'))   
# Output: [5.1]

print(pattern.findall('2 Approximately 10.2 +/- 30hours'))        
# Output: ['10.2']

print(pattern.findall('The mean half-life for Cetuximab is 114 hours (range 75-188 hours).'))        
# Output: ['114', '75']

print(pattern.findall('102 +/- 30 hours in individuals with rheumatoid arthritis and 68 hours in healthy adults.'))        
# Output: ['102', '68']

print(pattern.findall("102 +/- 30 hrs"))                          
# Output: ['102']

print(pattern.findall("102-130 hrs"))                             
# Output: ['102']

print(pattern.findall("102hrs"))                                  
# Output: ['102']

print(pattern.findall("102 hours"))                               
# Output: ['102']

解釋

以上使用了方便，原始字符串 (r'...') 和字符串插值 f'...' 可以組合為：

fr'...'

根據PEP 498

案例字符串：

fr'({number})(?:[\s\d\-\+\/]*)(?:{units})'

零件順序：

fr'({number})' - 捕獲組 '(\\d+[.,]?\\d*)' 用於整數或浮點數
r'(?:[\\s\\d-+/]*)' - 數字和單位之間允許的字符的非捕獲組（即空格、+、-、數字、/）
fr'(?:{units})' - 單位的非捕獲組

使用正則表達式提取特定單詞前的數字

問題描述

2 個解決方案

解決方案1
2 2020-11-12 09:23:11

解決方案2
1 已采納 2020-11-12 10:47:37

使用正則表達式提取特定單詞前的數字

問題描述

2 個解決方案

解決方案1 2 2020-11-12 09:23:11

解決方案2 1 已采納 2020-11-12 10:47:37

解決方案1
2 2020-11-12 09:23:11

解決方案2
1 已采納 2020-11-12 10:47:37