简体   繁体   English

如何在 Python 中使用正则表达式提取指定的匹配项?

[英]How do I extract specified matches using regular expression in Python?

I am trying to extract some matches using regular expression in Python.我正在尝试使用 Python 中的正则表达式提取一些匹配项。

Here is an example of a list I have这是我拥有的列表示例

x = ['PF13833.6', 'EF-hand_8', 'EF-hand domain pair', '34-72', 'E:1.6e-05`PF00036.32', 'EF-hand_1', 'EF hand', '48-73', 'E:1.6e-06`PF13202.6', 'EF-hand_5', 'EF hand', '49-71', 'E:0.004`PF13499.6', 'EF-hand_7', 'EF-hand domain pair', '86-148', 'E:9.6e-16`PF13405.6', 'EF-hand_6', 'EF-hand domain', '87-115', 'E:1.9e-06`PF13833.6', 'EF-hand_8', 'EF-hand domain pair', '100-148', 'E:5.2e-11`PF00036.32', 'EF-hand_1', 'EF hand', '123-149', 'E:5.5e-08`PF13202.6', 'EF-hand_5', 'EF hand', '129-148', 'E:0.00047']

And here is the regular expression I tried which worked to extract the PF id's这是我尝试过的用于提取 PF id 的正则表达式

re.findall(r'PF\d+\.\d+', str(x), re.MULTILINE|re.IGNORECASE)
['PF13833.6', 'PF00036.32', 'PF13202.6', 'PF13499.6', 'PF13405.6', 'PF13833.6', 'PF00036.32', 'PF13202.6']

But I want to extract the next word after the match.但我想在匹配后提取下一个单词。 For example例如

['PF13833.6', 'EF-hand_8', 'PF00036.32', ''EF-hand_1'' and son on..]

How can I modify my pattern to achieve the requisite output?如何修改我的模式以获得必要的输出?

You can use regular expressions and Boolean indexing with Pandas:您可以在 Pandas 中使用正则表达式和布尔索引:

import pandas as pd

# put your data in a Pandas Series
x = pd.Series(['PF13833.6', 'EF-hand_8', 'EF-hand domain pair', '34-72', 'E:1.6e-05`PF00036.32', 
               'EF-hand_1', 'EF hand', '48-73', 'E:1.6e-06`PF13202.6', 'EF-hand_5', 'EF hand', 
               '49-71', 'E:0.004`PF13499.6', 'EF-hand_7', 'EF-hand domain pair', '86-148', 
               'E:9.6e-16`PF13405.6', 'EF-hand_6', 'EF-hand domain', '87-115', 
               'E:1.9e-06`PF13833.6', 'EF-hand_8', 'EF-hand domain pair', '100-148', 
               'E:5.2e-11`PF00036.32', 'EF-hand_1', 'EF hand', '123-149', 'E:5.5e-08`PF13202.6', 
               'EF-hand_5', 'EF hand', '129-148', 'E:0.00047'])

# your regular expression for the PF ids 
PF_re = r'PF\d+\.\d+'

# find the PF ids
PF_ids = x.str.findall(PF_re)
# get rid of the lists in the result
PF_ids = PF_ids.str[0]

# create a Boolean Series to use as an index for those elements of x that contain a PF id
PF_index = x.str.contains(PF_re)
# shift this index to get an index for the next words
next_index = PF_index.shift()
# replace the resulting missing value in the first entry
next_index[0] = False

# put the results in a DataFrame and show them
results = pd.DataFrame({'PF id': list(PF_ids[PF_index]), 
                        'next word': list(x[next_index])})
display(results)

Output:输出:

    PF id       next word
0   PF13833.6   EF-hand_8
1   PF00036.32  EF-hand_1
2   PF13202.6   EF-hand_5
3   PF13499.6   EF-hand_7
4   PF13405.6   EF-hand_6
5   PF13833.6   EF-hand_8
6   PF00036.32  EF-hand_1
7   PF13202.6   EF-hand_5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM