简体   繁体   English

从Python中的句子中提取空格分隔的单词

[英]Extract space separated words from a sentence in Python

I have list of strings say, x1 = ['esk','wild man','eskimo', 'sta','(+)-6-[amina(4-chlora)(1-metha-1h-imidol-5-yl)mhyl]-4-(3-chlora)-1-methyl-2(1h)-quinoa'] I need to extract the x1s present in few sentences.我有字符串列表说, x1 = ['esk','wild man','eskimo', 'sta','(+)-6-[amina(4-chlora)(1-metha-1h-imidol-5-yl)mhyl]-4-(3-chlora)-1-methyl-2(1h)-quinoa']我需要在几句话中提取 x1s。

My sentence is "eskimo lives as a wild man in wild jungle and he stands as a guard".我的句子是"eskimo lives as a wild man in wild jungle and he stands as a guard". In the sentence, I need to extract first word eskimo and the seventh and eighth words wild man and they are separate words as in x1.在句子中,我需要提取第一个单词 eskimo 和第七个和第八个单词 wild man,它们是单独的单词,如 x1。 I should not extract "stands" even though sta is present in stands.即使 sta 出现在看台中,我也不应该提取“看台”。

def get_name(input_str):

 prod_name= []
    for row in x1:
        if (row.strip().lower()in input_str.lower().strip()) or (len([x for x in input_str.split() if "\b"+x in row])>0):
            prod_name.append(row) 
return list(set(prod_name))

The function get_name("eskimo lives as a wild man in wild jungle and he stands as a guard") returns函数get_name("eskimo lives as a wild man in wild jungle and he stands as a guard")返回

[esk, eskimo,wild man,sta]

But the expected is但预期是

[eskimo,wild man]

May I know what has to be changed in the code?我可以知道代码中需要更改的内容吗?

You could simply use str.split(" ") to get a list of all the words in the sentence, and then do the following:您可以简单地使用 str.split(" ") 获取句子中所有单词的列表,然后执行以下操作:

s = "eskimo lives as a wild man in wild jungle and he stands as a guard"

l = s.split(" ")

x1 = ['esk','wild man','eskimo', 'sta','(+)-6-[amina(4-chlora)(1-metha-1h-imidol-5-yl)mhyl]-4-(3-chlora)-1-methyl-2(1h)-quinoa']
new_x1 = [word.split(" ") for word in x1 if " " in word] + [word for word in x1 if " " not in word]

ans = []

for x in new_x1:
    if type(x) == str:
        if x in l:
            ans.append(x)
    else:
        temp = ""
        for i in x:
            temp += i + " "
        temp = temp[:-1]
        if all(sub_x in l for sub_x in x) and temp in s:
            ans.append(temp)

print(ans)

I have a slightly different approach.我有一个稍微不同的方法。 Firstly you could split the input sentence into words and also split each of the phrases you want to check for into constituent words.首先,您可以将输入句子拆分为单词,并将要检查的每个短语拆分为组成单词。 Then check if each of all words of a phrase are present in the sentence.然后检查句子中是否存在一个短语的所有单词。

x1 = ['esk','wild man','eskimo', 'sta','(+)-6-[amina(4-chlora)(1-metha-1h-imidol-5-yl)mhyl]-4-(3-chlora)-1-methyl-2(1h)-quinoa']
input_sentence = "eskimo lives as a wild man in wild jungle and he stands as a guard"
# Remove all punctuation marks from the sentence
input_sentence = input_sentence.replace('!', '').replace('.', '').replace('?', '').replace(',', '')
# Split the input sentence into its component words to check individually
input_words = input_sentence.split()

for ele in x1:
    # Split each element in x1 into words
    ele_words = ele.split()
    # Check if all words are part of the input words
    if all(ele in input_words for ele in ele_words) and ele in input_sentence:
        print(ele)

You can use regular expressions您可以使用正则表达式

import re

x1 = ['esk','wild man','eskimo', 'sta']

my_str = "eskimo lives as a wild man in wild jungle and he stands as a guard"
my_list = []

for words in x1:
    if re.search(r'\b' + words + r'\b', my_str):
        my_list.append(words)
print(my_list)

According to the new list, because the string (+)-6-[amina(4-chlora)(1-metha-1h-imidol-5-yl)mhyl]-4-(3-chlora)-1-methyl-2(1h)-quinoa generate an error with regular expressions you can use a try except block根据新列表,因为字符串(+)-6-[amina(4-chlora)(1-metha-1h-imidol-5-yl)mhyl]-4-(3-chlora)-1-methyl-2(1h)-quinoa使用正则表达式生成错误,您可以使用try except

for words in x1:
  try:
    if re.search(r'\b' + words + r'\b', my_str):
      my_list.append(words)
  except:
    pass

You could use a regex with whitespace boundaries on the left (?<!\S) and right (?!\S) to not get partial matches, and join all the items from the x1 list.您可以在左侧(?<!\S)和右侧(?!\S)使用带有空格边界的正则表达式来不获​​得部分匹配,并加入x1列表中的所有项目。

Then use re.findall to get all the matches:然后使用 re.findall 获取所有匹配项:

import re

x1 = ['esk','wild man','eskimo', 'sta','(+)-6-[amina(4-chlora)(1-metha-1h-imidol-5-yl)mhyl]-4-(3-chlora)-1-methyl-2(1h)-quinoa']
s = "eskimo lives as a wild man in wild jungle and he stands as a guard"
pattern = fr"(?<!\S)(?:{'|'.join(re.escape(x) for x in x1)})(?!\S)"

print(re.findall(pattern, s))

Output输出

['eskimo', 'wild man']

See a Python demo .查看Python 演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM