简体   繁体   English

从包含 substring 的句子中提取单词

[英]Extract words from sentence that are containing substring

I want to extract full phrase (one or multiple words) that contain the specific substring. Substring can have one multiple words, and words from substring can 'break'/'split' words in the test_string , but desired output is full phrase/word from test_string , for example我想提取包含特定 substring 的完整短语(一个或多个单词)。Substring 可以有一个多个单词,substring 中的单词可以在 test_string 中“中断”/“拆分”单词,但所需的test_string是完整的短语/单词来自test_string ,例如

test_string = 'this is an example of the text that I have, and I want to by amplifier and lamp'
substring1 = 'he text th'
substring2 = 'amp'

if substring1 in test_string:
    print("substring1 found")
    
if substring2 in test_string:
    print("substring2 found")

My desired output is:我想要的 output 是:

[the text that]
[example, amplifier, lamp]

FYI供参考

Substring can be at the beginning of the word, middle or end...it does not matter. Substring 可以在单词的开头,中间或结尾......没关系。

this is a job for regex, as you could do:这是正则表达式的工作,您可以这样做:

import re
substring2 = 'amp'
test_string = 'this is an example of the text that I have'

print("matches for substring 1:",re.findall(r"(\w+he text th\w+)", test_string))
print("matches for substring 2:",re.findall(r"(\w+amp\w+)",test_string))

Output: Output:

matches for substring 1:['the text that']
matches for substring 2:['example']

If you want something robust I would do something like that:如果你想要一些强大的东西,我会做这样的事情:

re.findall(r"((?:\w+)?" + re.escape(substring2) + r"(?:\w+)?)", test_string)

This way you can have whatever you want in substring.这样你就可以在 substring 中拥有任何你想要的东西。

Explanation of the regex:正则表达式的解释:

'(?:\w+)'   Non capturing group
'?'         zero or one

I have done this at the begining and at the end of your substring as it can be the start or the end of the missing part我在 substring 的开头和结尾都这样做了,因为它可以是缺失部分的开头或结尾

import re

test_string = 'this is an example of the text that I have, and I want to by amplifier and lamp'
substrings = ['he text th', 'amp']

for substring in substrings:
    print(re.findall(rf'\s(\w*{substring}\w*)\s', test_string))

OUTPUT: OUTPUT:

['the text that']
['example', 'amplifier']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM