简体   繁体   English

Python 找到关键字后提取句子

[英]Python extract sentence after a keyword is found

I have a string based on some text I have extracted and a list of keywords.我有一个基于我提取的一些文本和关键字列表的字符串。 I woud like to run through the string and extract only the sentence after the sentence where the keyword is found and remove the full stop too.我想遍历字符串并仅提取找到关键字的句子之后的句子,并删除句号。

String细绳

'Test string. removing data. keyword extraction. data number. 11123. final answer.'

Here is my list of key phrases:这是我的关键短语列表:

lst= ['Test string', 'data number']

Desired output:所需的 output:

['removing data', '11123']

Please could someone help me out/ point in the right direction?请有人帮我指出正确的方向吗? Thanks谢谢

Here is my suggestion:这是我的建议:

s='Test string. removing data. keyword extraction. data number. 11123. final answer.'

temp = [i.strip() for i in s.split('.')]

res = [temp[temp.index(i)+1] for i in lst]

print(res)

Output: Output:

['removing data', '11123']

What it does:它能做什么:

temp = [i.strip() for i in s.split('.')]

s.split('.') converts your string in list of strings, split by dot. s.split('.')将您的字符串转换为字符串列表,按点分隔。 So you are getting each sentence separated:所以你把每个句子分开:

['Test string', ' removing data', ' keyword extraction', ' data number', ' 11123', ' final answer', '']

This is put in a list comprehension , which creates a new list from the above one with stripped values ( i.strip() removes the leading and trailing whitespaces).这被放在一个列表理解中,它从上面的列表中创建一个带有剥离值的新列表( i.strip()删除前导和尾随空格)。 So you end up with:所以你最终得到:

['Test string', 'removing data', 'keyword extraction', 'data number', '11123', 'final answer', '']

On the last step there are two interesting things:在最后一步有两件有趣的事情:

  1. we use the list.index() method, which gives us the index of the searched item.我们使用list.index()方法,它为我们提供了搜索项的索引。 Than it is easy to get the next element.比获得下一个元素更容易。
  2. This is fast when you have a big string and few search items, but you should be careful, because it will fail if you are searching for a non-existing item.当你有一个大字符串和很少的搜索项时,这很快,但你应该小心,因为如果你正在搜索一个不存在的项目,它会失败。

It is safer to make it straight forward:直截了当更安全:

res = [temp[idx+1] for idx, val in enumerate(temp) if val in lst]

For more information on enumerate, check the documentation .有关枚举的更多信息,请查看文档

Here's one solution.这是一个解决方案。 Essentially you split the input based on the dot and space to make a list.本质上,您根据点和空格拆分输入以制作列表。 Then you iterate over and see if it exists.然后你遍历看看它是否存在。 If it does, you add the next element to your output list.如果是,则将下一个元素添加到 output 列表中。

Code:代码:

input = 'Test string. removing data. keyword extraction. data number. 11123. final answer.'
input_as_list = input.split('. ')
lst = ['Test string', 'data number']
result = []
for i in range(0, len(input_as_list)):
    for item in lst:
        if input_as_list [i] == item :
            result.append(input_as_list [i+1])
print(result)

Result:结果:

['removing data', '11123']

Use list comprehension , re.split and enumerate :使用list comprehensionre.splitenumerate

import re
my_str = 'Test string. removing data. keyword extraction. data number. 11123. final answer.'
key_phrases = ['Test string', 'data number']
my_str_phrases = re.split(r'[.]\s*', my_str)
print([my_str_phrases[idx + 1] for idx, item in enumerate(my_str_phrases) if item in key_phrases])
# ['removing data', '11123']

Note:笔记:
[.]\s* : Literal dot (needs to be either part of the character class [] or escaped like this: .), followed by 0 or more occurrences of whitespace. [.]\s* :文字点(需要是字符 class []的一部分或像这样转义:.),后跟 0 次或多次出现的空格。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM