简体   繁体   English

有没有办法在正则表达式 python 中检查同一字符串中的两个不同模式?

[英]Is there a way to check for two different patterns in the same string in regex python?

I want to extract certain digits from a string.我想从字符串中提取某些数字。 The problem is that the string can contain the digits in two different patterns.问题是字符串可以包含两种不同模式的数字。 How can I create a regex pattern in re.search such that I can have both patterns to search in a single string?如何在re.search中创建正则表达式模式,以便我可以同时使用两种模式在单个字符串中进行搜索?

For eg,例如,

## extract 65.45 from this string
string = '1112 (65.45%)'

So, if I do the following, it works所以,如果我执行以下操作,它会起作用

re.search('.*?\((.*)%\)', string).group(1)

and I get the expected result 65.45 .我得到了预期的结果65.45

Now, I have another kind of string in the same text that I need to look for.现在,我需要在同一文本中查找另一种字符串。

## from this string, extract 4.00 which appears before [
string = '4.00 [3.00 - 4.50]'

re.search('^(\S+)\s\[.*', string).group(1)

gives me the desired result: 4.00给了我想要的结果:4.00

But if I combine them like the following, it only extracts the one that matches first.但是如果我像下面这样组合它们,它只会提取第一个匹配的。

re.search('^(\S+)\s\[.*|.*?\((.*)%\)', string).group(1)

in which case, only the string that contains the square bracket extracts the value, not if the string has a % sign.在这种情况下,只有包含方括号的字符串才会提取值,如果字符串有 % 符号则不会。 How can I fix this?我怎样才能解决这个问题?

For eg, If I have a list of strings like the following:例如,如果我有一个字符串列表,如下所示:

['73 (1.40%)', '38 (1.55%)', '27 (2.17%)', '32 (1.46%)', '10 (1.46%)', '11 (1.04%)', '11 (1.41%)', '7 (1.34%)', '4 (1.24%)', '28 (1.27%)', '750 (14.41%)', '381 (15.54%)', '182 (14.60%)', '313 (14.27%)', '4.10 [3.73 - 4.45]', '4.08 [3.70 - 4.42]', '4.13 [3.77 - 4.47]', '4.13 [3.78 - 4.47]', '4.07 [3.70 - 4.42]', '4.07 [3.70 - 4.43]', '4.07 [3.70 - 4.40]', '4.09 [3.73 - 4.42]', '4.03 [3.63 - 4.40]', '4.10 [3.70 - 4.47]']

I want to do certain things with each value that is extracted and compare with a specific threshold value.我想对提取的每个值做某些事情并与特定的阈值进行比较。

Using the for-loop, I did something like this:使用for循环,我做了这样的事情:

for val in string: 
    match = re.search('^(\S+)\s\[.*|.*?\((.*)%\)', val)
    print(match)

which results in the following:结果如下:

<re.Match object; span=(0, 10), match='73 (1.40%)'>
<re.Match object; span=(0, 10), match='38 (1.55%)'>
<re.Match object; span=(0, 10), match='27 (2.17%)'>
<re.Match object; span=(0, 10), match='32 (1.46%)'>
<re.Match object; span=(0, 10), match='10 (1.46%)'>
<re.Match object; span=(0, 10), match='11 (1.04%)'>
<re.Match object; span=(0, 10), match='11 (1.41%)'>
<re.Match object; span=(0, 9), match='7 (1.34%)'>
<re.Match object; span=(0, 9), match='4 (1.24%)'>
<re.Match object; span=(0, 10), match='28 (1.27%)'>
<re.Match object; span=(0, 12), match='750 (14.41%)'>
<re.Match object; span=(0, 12), match='381 (15.54%)'>
<re.Match object; span=(0, 12), match='182 (14.60%)'>
<re.Match object; span=(0, 12), match='313 (14.27%)'>
<re.Match object; span=(0, 18), match='4.10 [3.73 - 4.45]'>
<re.Match object; span=(0, 18), match='4.08 [3.70 - 4.42]'>
<re.Match object; span=(0, 18), match='4.13 [3.77 - 4.47]'>
<re.Match object; span=(0, 18), match='4.13 [3.78 - 4.47]'>
<re.Match object; span=(0, 18), match='4.07 [3.70 - 4.42]'>
<re.Match object; span=(0, 18), match='4.07 [3.70 - 4.43]'>
<re.Match object; span=(0, 18), match='4.07 [3.70 - 4.40]'>
<re.Match object; span=(0, 18), match='4.09 [3.73 - 4.42]'>
<re.Match object; span=(0, 18), match='4.03 [3.63 - 4.40]'>
<re.Match object; span=(0, 18), match='4.10 [3.70 - 4.47]'>

But not sure how to extract the exact value.但不确定如何提取确切值。

I have to do the.group() to extract the value, but it requires me to know the exact location.我必须执行 the.group() 来提取值,但这需要我知道确切的位置。 And I'm struggling to figure out how to do that.我正在努力弄清楚如何做到这一点。

If I do match.group(2) , then I get the following result:如果我做match.group(2) ,那么我会得到以下结果:

1.40
1.55
2.17
1.46
1.46
1.04
1.41
1.34
1.24
1.27
14.41
15.54
14.60
14.27
None
None
None
None
None
None
None
None
None
None

Here is an approach which works with your exact input, in which each list entry always would have one of the two matching patterns:这是一种适用于您的确切输入的方法,其中每个列表条目始终具有两种匹配模式之一:

inp = ['73 (1.40%)', '38 (1.55%)', '27 (2.17%)', '32 (1.46%)', '10 (1.46%)', '11 (1.04%)', '11 (1.41%)', '7 (1.34%)', '4 (1.24%)', '28 (1.27%)', '750 (14.41%)', '381 (15.54%)', '182 (14.60%)', '313 (14.27%)', '4.10 [3.73 - 4.45]', '4.08 [3.70 - 4.42]', '4.13 [3.77 - 4.47]', '4.13 [3.78 - 4.47]', '4.07 [3.70 - 4.42]', '4.07 [3.70 - 4.43]', '4.07 [3.70 - 4.40]', '4.09 [3.73 - 4.42]', '4.03 [3.63 - 4.40]', '4.10 [3.70 - 4.47]']
matches = [re.findall(r'\b\d+ \((\d+(?:\.\d+)?%)\)|(\d+(?:\.\d+)?) \[\d+(?:\.\d+)? - \d+(?:\.\d+)?\]', x) for x in inp]
matches = [x[0][0] + x[0][1] for x in matches]
print(matches)

This prints:这打印:

['1.40%', '1.55%', '2.17%', '1.46%', '1.46%', '1.04%', '1.41%', '1.34%',
 '1.24%', '1.27%', '14.41%', '15.54%', '14.60%', '14.27%', '4.10', '4.08',
 '4.13', '4.13', '4.07', '4.07', '4.07', '4.09', '4.03', '4.10']

The strategy used above is to match, in two separate groups, either the first digit in the percentage input, or the number outside the square brackets.上面使用的策略是在两个单独的组中匹配百分比输入中的第一个数字或方括号外的数字。 Then, in a list comprehension, we concatenate the two capture groups together.然后,在列表推导中,我们将两个捕获组连接在一起。 Since one of the two groups is guaranteed to be empty, the concatenated result always corresponds to the desired match.由于两个组之一保证为空,因此连接的结果始终对应于所需的匹配。

I would just use a list of simple regexs and iterate through them for each string I want to test.我只会使用一个简单的正则表达式列表,并为我要测试的每个字符串遍历它们。 The first regex that gets a hit will be used.将使用第一个命中的正则表达式。 I would also compile the regex upfront to save CPU cycles.我还将预先编译正则表达式以节省 CPU 周期。 This is easier to follow readability wise and easy to add new patterns to:这更易于遵循可读性并且易于将新模式添加到:

import re

regexs = [
    re.compile(r".*?\((.*)%\)"), 
    re.compile(r"^(\S+)\s\[.*"),
]

data = [
    "73 (1.40%)",
    "38 (1.55%)",
    "27 (2.17%)",
    "750 (14.41%)",
    "381 (15.54%)",
    "4.10 [3.73 - 4.45]",
    "4.08 [3.70 - 4.42]",
    "4.13 [3.77 - 4.47]",
    "this shouldn't match"
]


for val in data:
    for regex in regexs:
        if match := regex.search(val):
            print("Matched: " + match.group(1))
            break
    else:
        print("No match: " + val)

Outputs:输出:

Matched: 1.40
Matched: 1.55
Matched: 2.17
Matched: 14.41
Matched: 15.54
Matched: 4.10
Matched: 4.08
Matched: 4.13
No match: this shouldn't match

.group returns captured groups, so .group(1) always returns the first captured group. .group返回捕获的组,因此.group(1)始终返回第一个捕获的组。

To get the other capture group, use .group(2)要获取另一个捕获组,请使用.group(2)

Another option is to use lookarounds to get a match only:另一种选择是使用环视来获得匹配:

(?<=\()\d+(?:\.\d+)?(?=%\))|\d+(?:\.\d+)?(?=\s*\[[^][]*])

The pattern matches模式匹配

  • (?<=\() Positive lookbehind, assert ( to the left (?<=\()正向向后看,向左断言(
  • \d+(?:\.\d+)? Match 1+ digits with an optional decimal part将 1+ 位数字与可选的小数部分匹配
  • (?=%\)) Positive lookahead, assert ) to the right (?=%\))正向前瞻,断言)向右
  • | Or或者
  • \d+(?:\.\d+)? Match 1+ digits with an optional decimal part将 1+ 位数字与可选的小数部分匹配
  • (?=\s*\[[^][]*]) Positive lookahead, assert an opening till closing square bracket to the right (You could make it more specific by specifying the exact format between the square brackets) (?=\s*\[[^][]*])正向前瞻,断言从右到右方括号的开头(您可以通过指定方括号之间的确切格式来使其更具体)

Regex demo |正则表达式演示| Python demo Python 演示

import re

pattern = r"(?<=\()\d+(?:\.\d+)?(?=%\))|\d+(?:\.\d+)?\b(?=\s*\[[^][]*\])"
strings = ['73 (1.40%)', '38 (1.55%)', '27 (2.17%)', '32 (1.46%)', '10 (1.46%)', '11 (1.04%)', '11 (1.41%)', '7 (1.34%)', '4 (1.24%)', '28 (1.27%)', '750 (14.41%)', '381 (15.54%)', '182 (14.60%)', '313 (14.27%)', '4.10 [3.73 - 4.45]', '4.08 [3.70 - 4.42]', '4.13 [3.77 - 4.47]', '4.13 [3.78 - 4.47]', '4.07 [3.70 - 4.42]', '4.07 [3.70 - 4.43]', '4.07 [3.70 - 4.40]', '4.09 [3.73 - 4.42]', '4.03 [3.63 - 4.40]', '4.10 [3.70 - 4.47]']
for val in strings:
    match = re.search(pattern, val)
    print(match)

Output Output

<re.Match object; span=(4, 8), match='1.40'>
<re.Match object; span=(4, 8), match='1.55'>
<re.Match object; span=(4, 8), match='2.17'>
<re.Match object; span=(4, 8), match='1.46'>
<re.Match object; span=(4, 8), match='1.46'>
<re.Match object; span=(4, 8), match='1.04'>
<re.Match object; span=(4, 8), match='1.41'>
<re.Match object; span=(3, 7), match='1.34'>
<re.Match object; span=(3, 7), match='1.24'>
<re.Match object; span=(4, 8), match='1.27'>
<re.Match object; span=(5, 10), match='14.41'>
<re.Match object; span=(5, 10), match='15.54'>
<re.Match object; span=(5, 10), match='14.60'>
<re.Match object; span=(5, 10), match='14.27'>
<re.Match object; span=(0, 4), match='4.10'>
<re.Match object; span=(0, 4), match='4.08'>
<re.Match object; span=(0, 4), match='4.13'>
<re.Match object; span=(0, 4), match='4.13'>
<re.Match object; span=(0, 4), match='4.07'>
<re.Match object; span=(0, 4), match='4.07'>
<re.Match object; span=(0, 4), match='4.07'>
<re.Match object; span=(0, 4), match='4.09'>
<re.Match object; span=(0, 4), match='4.03'>
<re.Match object; span=(0, 4), match='4.10'>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM