[英]python regex extract based on a specific substring
I have a dataframe containing sentences like the following but with more rows:我有一个 dataframe 包含如下句子但有更多行:
data= {"text":["see you in five minutes.", "she is my friend.", "she goes to school in five minutes."]}
I would like to extract the sentences containing 'five minutes' in the manner presented below:我想以如下方式提取包含“五分钟”的句子:
desired output:
first part desired part
0 see you in five minutes.
1 NaN NaN
2 she goes to school in five minutes.
I am using the following code but it returns NaN:我正在使用以下代码,但它返回 NaN:
data.text.str.extract(r"(?i)(?P<before>.*)\s(?P<minutes>(?=five minutes\s))\w+ \w+")
You require a whitespace when there's none:当没有空格时,您需要一个空格:
(?i)(?P<before>.*)\s(?P<minutes>(?=five minutes\s))\w+ \w+
# ^^^
Either use the star quantifier (zero or more time) or rethink your expression.要么使用星量词(零次或多次),要么重新考虑你的表达方式。 The following works:以下作品:
import pandas as pd
data= {"text":["see you in five minutes.", "she is my friend.", "she goes to school in five minutes."]}
df = pd.DataFrame(data)
df2 = df.text.str.extract(r"(?i)(?P<before>.*?)(?=five minutes)(?P<after>.*)")
print(df2)
And yields和产量
before after
0 see you in five minutes.
1 NaN NaN
2 she goes to school in five minutes.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.