基于特定 substring 的 python 正则表达式提取

Question

I have a dataframe containing sentences like the following but with more rows:我有一个 dataframe 包含如下句子但有更多行：

data= {"text":["see you in five minutes.", "she is my friend.", "she goes to school in five minutes."]}

I would like to extract the sentences containing 'five minutes' in the manner presented below:我想以如下方式提取包含“五分钟”的句子：

desired output:

     first part              desired part     
0    see you in              five minutes.
1    NaN                     NaN
2    she goes to school in   five minutes.

I am using the following code but it returns NaN:我正在使用以下代码，但它返回 NaN：

data.text.str.extract(r"(?i)(?P<before>.*)\s(?P<minutes>(?=five minutes\s))\w+ \w+")

Answer 1

You require a whitespace when there's none:当没有空格时，您需要一个空格：

(?i)(?P<before>.*)\s(?P<minutes>(?=five minutes\s))\w+ \w+
#                                              ^^^

Either use the star quantifier (zero or more time) or rethink your expression.要么使用星量词（零次或多次），要么重新考虑你的表达方式。 The following works:以下作品：

import pandas as pd

data= {"text":["see you in five minutes.", "she is my friend.", "she goes to school in five minutes."]}

df = pd.DataFrame(data)
df2 = df.text.str.extract(r"(?i)(?P<before>.*?)(?=five minutes)(?P<after>.*)")
print(df2)

And yields和产量

                   before          after
0             see you in   five minutes.
1                     NaN            NaN
2  she goes to school in   five minutes.

基于特定 substring 的 python 正则表达式提取

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-06-25 07:54:27

基于特定 substring 的 python 正则表达式提取

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-06-25 07:54:27

解决方案1
1 已采纳 2020-06-25 07:54:27