[英]How to extract a sentence with start marker?
I have a list of strings.我有一个字符串列表。 I want to extract from each the first sentence starting with
"I am"
.我想从每个句子中提取以
"I am"
开头的第一句话。 For example:例如:
"I am OK. How about you?"
-> "I am OK."
->
"I am OK."
"I think I am better than that."
-> "I am better than that."
->
"I am better than that."
I'm using a very simple function that works most of the time:我正在使用一个非常简单的 function 大部分时间都可以使用:
def extract_sentence(text):
start_idx = text.find("I am")
dot_idx = text[pred_idx:].find(".")
return text[start_idx:start_idx + dot_idx + 1]
But it won't work in some instances.但它在某些情况下不起作用。 For example:
例如:
"I am enjoying https://stackoverflow.com/. People are very helpful there."
-> "I am enjoing https://stackoverflow."
->
"I am enjoing https://stackoverflow."
(should have been "I am enjoying https://stackoverflow.com/."
) (应该是
"I am enjoying https://stackoverflow.com/."
)
"I am worried about Josh"
-> ""
(no ending dot, should have been a copy of the input, ie "I am worried about Josh"
). "I am worried about Josh"
-> ""
(没有结束点,应该是输入的副本,即"I am worried about Josh"
)。
I believe some regex would help me here, but I don't know how to start.我相信一些正则表达式会在这里帮助我,但我不知道如何开始。 Any ideas?
有任何想法吗?
You could use a positive lookahead to end the match at a period that is followed by whitespace, or the end of the line.您可以使用积极的前瞻来在后面跟着空格或行尾的句点结束匹配。 Consider the following regex:
考虑以下正则表达式:
I am .*?(?:\.(?=\s)|$)
Explanation:
I am : Literally "I am "
.*? : One or more of any character, lazy match
(?: ) : Non-capturing group
\. : A literal period
(?= ) : Positive lookahead
\s : Whitespace or end of string
|$ : End of string / line
Try it on regex101在 regex101 上试试
If you want to allow multiple characters to end a sentence, modify the regex to match those instead of only a period: I am.*?(?:[.??](?=\s)|$)
如果您想允许多个字符结束一个句子,请修改正则表达式以匹配那些而不是仅一个句点:
I am.*?(?:[.??](?=\s)|$)
In python:在 python 中:
import re
strings = """I am OK. How about you?
I think I am better than that.
I am enjoing https://stackoverflow.com/. People are very helpful there.
I am worried about Josh""".split('\n')
regex = re.compile(r"I am .*?(?:\.(?=\s)|$)")
for s in strings:
print("Input string:\n" + repr(s))
m = regex.findall(s)
if m:
print("Matched string:\n" + repr(m[0]))
else:
print("No match")
print("")
which gives the expected output:这给出了预期的 output:
Input string:
'I am OK. How about you?'
Matched string:
'I am OK.'
Input string:
'I think I am better than that.'
Matched string:
'I am better than that.'
Input string:
'I am enjoing https://stackoverflow.com/. People are very helpful there.'
Matched string:
'I am enjoing https://stackoverflow.com/.'
Input string:
'I am worried about Josh'
Matched string:
'I am worried about Josh'
A very simple solution is to just get everything starting a I am
and not including any .
一个非常简单的解决方案是让所有内容都以
I am
开头,而不包括任何.
followed by a space (since you want the periods in a URL):后跟一个空格(因为您需要 URL 中的句点):
import re
text = "I think I am enjoying https://stackoverflow.com/. People are very helpful there."
print(re.search('I am(?:[^\.]|\.(?!\s))*', text).group(0))
text = "I am happy. I am satisfied."
print(re.search('I am(?:[^\.]|\.(?!\s))*', text).group(0))
Output: Output:
I am enjoing https://stackoverflow.com/
I am happy
The regex works because it matches anything starting with I am
, followed by a repetition of zero or more times ( *
) either a character that's not a period ( [^\.]
) or ( |
) a period that's not followed by whitespace ( \.(?!\s)
).正则表达式之所以有效,是因为它匹配以
I am
开头的任何内容,然后是重复零次或多次 ( *
) 不是句点的字符 ( [^\.]
) 或 ( |
) 后面没有空格的句点 ( \.(?!\s)
)。 All of that's in parentheses around the |
所有这些都在
|
周围的括号中。 with a ?:
at the start to indicate that the group does not need to be matched separately (for efficiency).用
?:
开头表示组不需要单独匹配(为了效率)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.