如何在 Python 中使用正则表达式提取指定的匹配项？

Question

I am trying to extract some matches using regular expression in Python.我正在尝试使用 Python 中的正则表达式提取一些匹配项。

Here is an example of a list I have这是我拥有的列表示例

x = ['PF13833.6', 'EF-hand_8', 'EF-hand domain pair', '34-72', 'E:1.6e-05`PF00036.32', 'EF-hand_1', 'EF hand', '48-73', 'E:1.6e-06`PF13202.6', 'EF-hand_5', 'EF hand', '49-71', 'E:0.004`PF13499.6', 'EF-hand_7', 'EF-hand domain pair', '86-148', 'E:9.6e-16`PF13405.6', 'EF-hand_6', 'EF-hand domain', '87-115', 'E:1.9e-06`PF13833.6', 'EF-hand_8', 'EF-hand domain pair', '100-148', 'E:5.2e-11`PF00036.32', 'EF-hand_1', 'EF hand', '123-149', 'E:5.5e-08`PF13202.6', 'EF-hand_5', 'EF hand', '129-148', 'E:0.00047']

And here is the regular expression I tried which worked to extract the PF id's这是我尝试过的用于提取 PF id 的正则表达式

re.findall(r'PF\d+\.\d+', str(x), re.MULTILINE|re.IGNORECASE)
['PF13833.6', 'PF00036.32', 'PF13202.6', 'PF13499.6', 'PF13405.6', 'PF13833.6', 'PF00036.32', 'PF13202.6']

But I want to extract the next word after the match.但我想在匹配后提取下一个单词。 For example例如

['PF13833.6', 'EF-hand_8', 'PF00036.32', ''EF-hand_1'' and son on..]

How can I modify my pattern to achieve the requisite output?如何修改我的模式以获得必要的输出？

Answer 1

You can use regular expressions and Boolean indexing with Pandas:您可以在 Pandas 中使用正则表达式和布尔索引：

import pandas as pd

# put your data in a Pandas Series
x = pd.Series(['PF13833.6', 'EF-hand_8', 'EF-hand domain pair', '34-72', 'E:1.6e-05`PF00036.32', 
               'EF-hand_1', 'EF hand', '48-73', 'E:1.6e-06`PF13202.6', 'EF-hand_5', 'EF hand', 
               '49-71', 'E:0.004`PF13499.6', 'EF-hand_7', 'EF-hand domain pair', '86-148', 
               'E:9.6e-16`PF13405.6', 'EF-hand_6', 'EF-hand domain', '87-115', 
               'E:1.9e-06`PF13833.6', 'EF-hand_8', 'EF-hand domain pair', '100-148', 
               'E:5.2e-11`PF00036.32', 'EF-hand_1', 'EF hand', '123-149', 'E:5.5e-08`PF13202.6', 
               'EF-hand_5', 'EF hand', '129-148', 'E:0.00047'])

# your regular expression for the PF ids 
PF_re = r'PF\d+\.\d+'

# find the PF ids
PF_ids = x.str.findall(PF_re)
# get rid of the lists in the result
PF_ids = PF_ids.str[0]

# create a Boolean Series to use as an index for those elements of x that contain a PF id
PF_index = x.str.contains(PF_re)
# shift this index to get an index for the next words
next_index = PF_index.shift()
# replace the resulting missing value in the first entry
next_index[0] = False

# put the results in a DataFrame and show them
results = pd.DataFrame({'PF id': list(PF_ids[PF_index]), 
                        'next word': list(x[next_index])})
display(results)

Output:输出：

    PF id       next word
0   PF13833.6   EF-hand_8
1   PF00036.32  EF-hand_1
2   PF13202.6   EF-hand_5
3   PF13499.6   EF-hand_7
4   PF13405.6   EF-hand_6
5   PF13833.6   EF-hand_8
6   PF00036.32  EF-hand_1
7   PF13202.6   EF-hand_5

如何在 Python 中使用正则表达式提取指定的匹配项？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-03-20 01:07:41

如何在 Python 中使用正则表达式提取指定的匹配项？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-03-20 01:07:41

解决方案1
1 已采纳 2020-03-20 01:07:41