如何提取带有开始标记的句子？

Question

I have a list of strings.我有一个字符串列表。 I want to extract from each the first sentence starting with "I am" .我想从每个句子中提取以"I am"开头的第一句话。 For example:例如：

"I am OK. How about you?" -> "I am OK." -> "I am OK."

"I think I am better than that." -> "I am better than that." -> "I am better than that."

I'm using a very simple function that works most of the time:我正在使用一个非常简单的 function 大部分时间都可以使用：

def extract_sentence(text):
    start_idx = text.find("I am")
    dot_idx = text[pred_idx:].find(".")

    return text[start_idx:start_idx + dot_idx + 1]

But it won't work in some instances.但它在某些情况下不起作用。 For example:例如：

"I am enjoying https://stackoverflow.com/. People are very helpful there." -> "I am enjoing https://stackoverflow." -> "I am enjoing https://stackoverflow." (should have been "I am enjoying https://stackoverflow.com/." ) （应该是"I am enjoying https://stackoverflow.com/." ）

"I am worried about Josh" -> "" (no ending dot, should have been a copy of the input, ie "I am worried about Josh" ). "I am worried about Josh" -> "" （没有结束点，应该是输入的副本，即"I am worried about Josh" ）。

I believe some regex would help me here, but I don't know how to start.我相信一些正则表达式会在这里帮助我，但我不知道如何开始。 Any ideas?有任何想法吗？

Answer 1

You could use a positive lookahead to end the match at a period that is followed by whitespace, or the end of the line.您可以使用积极的前瞻来在后面跟着空格或行尾的句点结束匹配。 Consider the following regex:考虑以下正则表达式：

I am .*?(?:\.(?=\s)|$) 
                      
Explanation:          
I am                   : Literally "I am "
     .*?               : One or more of any character, lazy match
        (?:          ) : Non-capturing group
           \.          : A literal period
             (?=  )    : Positive lookahead
                \s     : Whitespace or end of string
                   |$  : End of string / line

Try it on regex101在 regex101 上试试

If you want to allow multiple characters to end a sentence, modify the regex to match those instead of only a period: I am.*?(?:[.??](?=\s)|$)如果您想允许多个字符结束一个句子，请修改正则表达式以匹配那些而不是仅一个句点： I am.*?(?:[.??](?=\s)|$)

In python:在 python 中：

import re

strings = """I am OK. How about you?
I think I am better than that.
I am enjoing https://stackoverflow.com/. People are very helpful there.
I am worried about Josh""".split('\n')

regex = re.compile(r"I am .*?(?:\.(?=\s)|$)")

for s in strings:
    print("Input string:\n" + repr(s))
    m = regex.findall(s)
    if m:
        print("Matched string:\n" + repr(m[0]))
    else:
        print("No match")
    print("")

which gives the expected output:这给出了预期的 output：

Input string:
'I am OK. How about you?'
Matched string:
'I am OK.'

Input string:
'I think I am better than that.'
Matched string:
'I am better than that.'

Input string:
'I am enjoing https://stackoverflow.com/. People are very helpful there.'
Matched string:
'I am enjoing https://stackoverflow.com/.'

Input string:
'I am worried about Josh'
Matched string:
'I am worried about Josh'

Answer 2

A very simple solution is to just get everything starting a I am and not including any .一个非常简单的解决方案是让所有内容都以I am开头，而不包括任何. followed by a space (since you want the periods in a URL):后跟一个空格（因为您需要 URL 中的句点）：

import re

text = "I think I am enjoying https://stackoverflow.com/. People are very helpful there."
print(re.search('I am(?:[^\.]|\.(?!\s))*', text).group(0))

text = "I am happy. I am satisfied."
print(re.search('I am(?:[^\.]|\.(?!\s))*', text).group(0))

Output: Output：

I am enjoing https://stackoverflow.com/
I am happy

The regex works because it matches anything starting with I am , followed by a repetition of zero or more times ( * ) either a character that's not a period ( [^\.] ) or ( | ) a period that's not followed by whitespace ( \.(?!\s) ).正则表达式之所以有效，是因为它匹配以I am开头的任何内容，然后是重复零次或多次 ( * ) 不是句点的字符 ( [^\.] ) 或 ( | ) 后面没有空格的句点 ( \.(?!\s) )。 All of that's in parentheses around the |所有这些都在|周围的括号中。 with a ?: at the start to indicate that the group does not need to be matched separately (for efficiency).用?:开头表示组不需要单独匹配（为了效率）。

如何提取带有开始标记的句子？

问题描述

2 个解决方案

解决方案1
2 2022-08-14 23:40:02

解决方案2
1 2022-08-15 00:04:43

如何提取带有开始标记的句子？

问题描述

2 个解决方案

解决方案1 2 2022-08-14 23:40:02

解决方案2 1 2022-08-15 00:04:43

解决方案1
2 2022-08-14 23:40:02

解决方案2
1 2022-08-15 00:04:43