简体   繁体   English

Python提取包含单词的句子

[英]Python extract sentence containing word

I am trying to extract all the sentence containing a specified word from a text.我试图从文本中提取包含指定单词的所有句子。

txt="I like to eat apple. Me too. Let's go buy some apples."
txt = "." + txt
re.findall(r"\."+".+"+"apple"+".+"+"\.", txt)

but it is returning me :但它让我回来了:

[".I like to eat apple. Me too. Let's go buy some apples."]

instead of :代替 :

[".I like to eat apple., "Let's go buy some apples."]

Any help please ?请问有什么帮助吗?

No need for regex: 不需要正则表达式:

>>> txt = "I like to eat apple. Me too. Let's go buy some apples."
>>> [sentence + '.' for sentence in txt.split('.') if 'apple' in sentence]
['I like to eat apple.', " Let's go buy some apples."]
In [3]: re.findall(r"([^.]*?apple[^.]*\.)",txt)                                                                                                                             
Out[4]: ['I like to eat apple.', " Let's go buy some apples."]
In [7]: import re

In [8]: txt=".I like to eat apple. Me too. Let's go buy some apples."

In [9]: re.findall(r'([^.]*apple[^.]*)', txt)
Out[9]: ['I like to eat apple', " Let's go buy some apples"]

But note that @jamylak's split -based solution is faster: 但请注意,@ jamylak的基于split的解决方案更快:

In [10]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
1000000 loops, best of 3: 1.96 us per loop

In [11]: %timeit [s+ '.' for s in txt.split('.') if 'apple' in s]
1000000 loops, best of 3: 819 ns per loop

The speed difference is less, but still significant, for larger strings: 对于较大的字符串,速度差异较小,但仍然很重要:

In [24]: txt = txt*10000

In [25]: %timeit re.findall(r'([^.]*apple[^.]*)', txt)
100 loops, best of 3: 8.49 ms per loop

In [26]: %timeit [s+'.' for s in txt.split('.') if 'apple' in s]
100 loops, best of 3: 6.35 ms per loop

You can use str.split , 你可以使用str.split

>>> txt="I like to eat apple. Me too. Let's go buy some apples."
>>> txt.split('. ')
['I like to eat apple', 'Me too', "Let's go buy some apples."]

>>> [ t for t in txt.split('. ') if 'apple' in t]
['I like to eat apple', "Let's go buy some apples."]
r"\."+".+"+"apple"+".+"+"\."

This line is a bit odd; 这条线有点奇怪; why concatenate so many separate strings? 为什么连接这么多单独的字符串? You could just use r'..+apple.+.'. 你可以使用r'.. + apple。+。'。

Anyway, the problem with your regular expression is its greedy-ness. 无论如何,你的正则表达式的问题是它的贪婪。 By default a x+ will match x as often as it possibly can. 默认情况下, x+将尽可能频繁地匹配x So your .+ will match as many characters ( any characters) as possible; 所以你的.+将匹配尽可能多的字符( 任何字符); including dots and apple s. 包括点和apple

What you want to use instead is a non-greedy expression; 你想要使用的是一种非贪婪的表达; you can usually do this by adding a ? 你通常可以通过添加一个? at the end: .+? 最后: .+? .

This will make you get the following result: 这将使您获得以下结果:

['.I like to eat apple. Me too.']

As you can see you no longer get both the apple-sentences but still the Me too. 你可以看到你不再同时获得苹果句子,但仍然是Me too. . That is because you still match the . 那是因为你仍然匹配. after the apple , making it impossible to not capture the following sentence as well. apple ,也不可能不捕捉下面的句子。

A working regular expression would be this: r'\\.[^.]*?apple[^.]*?\\.' 一个有效的正则表达式是: r'\\.[^.]*?apple[^.]*?\\.'

Here you don't look at any characters, but only those characters which are not dots themselves. 在这里,您不会查看任何字符,而只会查看那些不是点本身的字符。 We also allow not to match any characters at all (because after the apple in the first sentence there are no non-dot characters). 我们也允许不匹配任何字符(因为在第一句中的apple之后没有非点字符)。 Using that expression results in this: 使用该表达式会导致:

['.I like to eat apple.', ". Let's go buy some apples."]

Obviously, the sample in question is extract sentence containing substring instead of 显然,有问题的样本是extract sentence containing substring而不是
extract sentence containing word . extract sentence containing word How to solve the extract sentence containing word problem through python is as follows: 如何通过python解决extract sentence containing word问题的extract sentence containing word如下:

A word can be in the begining|middle|end of the sentence. 一个词可以在句子的开头|中间。 Not limited to the example in the question, I would provide a general function of searching a word in a sentence: 不仅限于问题中的示例,我将提供在句子中搜索单词的一般功能:

def searchWordinSentence(word,sentence):
    pattern = re.compile(' '+word+' |^'+word+' | '+word+' $')
    if re.search(pattern,sentence):
        return True

limited to the example in the question, we can solve like: 仅限于问题中的示例,我们可以解决如下:

txt="I like to eat apple. Me too. Let's go buy some apples."
word = "apple"
print [ t for t in txt.split('. ') if searchWordofSentence(word,t)]

The corresponding output is: 相应的输出是:

['I like to eat apple']
import nltk
search = "test"
text = "This is a test text! Best text ever. Cool"
contains = [s for s in nltk.sent_tokenize(text) if search in s]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM