简体   繁体   English

基于特定 substring 的 python 正则表达式提取

[英]python regex extract based on a specific substring

I have a dataframe containing sentences like the following but with more rows:我有一个 dataframe 包含如下句子但有更多行:

data= {"text":["see you in five minutes.", "she is my friend.", "she goes to school in five minutes."]}

I would like to extract the sentences containing 'five minutes' in the manner presented below:我想以如下方式提取包含“五分钟”的句子:

desired output:

     first part              desired part     
0    see you in              five minutes.
1    NaN                     NaN
2    she goes to school in   five minutes.

I am using the following code but it returns NaN:我正在使用以下代码,但它返回 NaN:

data.text.str.extract(r"(?i)(?P<before>.*)\s(?P<minutes>(?=five minutes\s))\w+ \w+")    

You require a whitespace when there's none:当没有空格时,您需要一个空格:

(?i)(?P<before>.*)\s(?P<minutes>(?=five minutes\s))\w+ \w+
#                                              ^^^

Either use the star quantifier (zero or more time) or rethink your expression.要么使用星量词(零次或多次),要么重新考虑你的表达方式。 The following works:以下作品:

import pandas as pd

data= {"text":["see you in five minutes.", "she is my friend.", "she goes to school in five minutes."]}

df = pd.DataFrame(data)
df2 = df.text.str.extract(r"(?i)(?P<before>.*?)(?=five minutes)(?P<after>.*)")
print(df2)

And yields和产量

                   before          after
0             see you in   five minutes.
1                     NaN            NaN
2  she goes to school in   five minutes.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM