使用 re 库中的 findall 匹配多个子字符串

Question

I have a large array that contains strings with the following format in Python我有一个大数组，其中包含 Python 中具有以下格式的字符串

some_array = ['MATH_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE',
'SCIENCE_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE',
'ART_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE]

I just need to extract the substrings that start with MATH, SCIENCE and ART.我只需要提取以 MATH、SCIENCE 和 ART 开头的子字符串。 So what I'm currently using所以我目前正在使用

  my_str = re.findall('MATH_.*? ', some_array )

    if len(my_str) > 0:
        print(my_str)

    my_str = re.findall('SCIENCE_.*? ', some_array )

    if len(my_str) !=0:
        print(my_str)

    my_str = re.findall('ART_.*? ', some_array )

    if len(my_str) > 0:
        print(my_str)

It seems to work, but I was wondering if the findall function can look for more than one substring in the same line or maybe there is a cleaner way of doing it with another function. Thanks.它似乎有效，但我想知道 findall function 是否可以在同一行中查找多个 substring，或者可能有一种更简洁的方法来使用另一个 function。谢谢。

Answer 1

You can use |你可以使用| to match multiple different strings in a regular expression.在正则表达式中匹配多个不同的字符串。

re.findall('(?:MATH|SCIENCE|ART)_.*? ', ...)

You could also use str.startswith along with a list comprehension.您还可以将str.startswith与列表理解一起使用。

res = [x for x in some_array if any(x.startswith(prefix) 
          for prefix in ('MATH', 'SCIENCE', 'ART'))]

Answer 2

You could also match optional non whitespace characters after one of the alternations, start with a word boundary to prevent a partial word match and match the trailing single space:您还可以在其中一个交替之后匹配可选的非空白字符，以单词边界开头以防止部分单词匹配并匹配尾随的单个空格：

\b(?:MATH|SCIENCE|ART)_\S*

Regex demo正则表达式演示

Or if only word characters \w :或者如果只有单词字符\w ：

\b(?:MATH|SCIENCE|ART)_\w*

Example例子

import re

some_array = ['MATH_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE',
              'SCIENCE_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE',
              'ART_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE']

pattern = re.compile(r"\b(?:MATH|SCIENCE|ART)_\S* ")
for s in some_array:
    print(pattern.findall(s))

Output Output

['MATH_SOME_TEXT_AND_NUMBER ']
['SCIENCE_SOME_TEXT_AND_NUMBER ']
['ART_SOME_TEXT_AND_NUMBER ']

使用 re 库中的 findall 匹配多个子字符串

问题描述

2 个解决方案

解决方案1
1 2023-01-24 16:21:36

解决方案2
0 已采纳 2023-01-24 16:25:34

使用 re 库中的 findall 匹配多个子字符串

问题描述

2 个解决方案

解决方案1 1 2023-01-24 16:21:36

解决方案2 0 已采纳 2023-01-24 16:25:34

解决方案1
1 2023-01-24 16:21:36

解决方案2
0 已采纳 2023-01-24 16:25:34