简体   繁体   English

在python列表中搜索与长度可变的词干的自定义列表的匹配项

[英]Search a python list for matches to a custom list of stem words of varying length

I'm trying to search word-tokenized abstracts for custom stem words using python. 我正在尝试使用python在单词标记的摘要中搜索自定义词干。 The following code is almost what I want. 以下代码几乎是我想要的。 That is, do any of the values in stem_words appears once or more in word_tokenized_abstract? 也就是说,stem_words中的任何值是否在word_tokenized_abstract中出现一次或多次?

if(any(word in stem_words for word in word_tokenized_abstract)):
    do stuff

where... 哪里...

  • stem_words is a list of strings only stem_words仅是字符串列表
  • word_tokenized_abstract is a list of strings only word_tokenized_abstract仅是字符串列表

I based the above at one-liner to check if at least one item in list exists in another list? 我以单行代码为基础,检查列表中是否至少有一个项目存在于另一个列表中?

My issue is that my stem_words are of different lengths. 我的问题是我的stem_words的长度不同。 I've tried the following code (a modification of the above) which did not work for me. 我尝试了以下代码(对上面的修改),但对我来说不起作用。 I've tried a few other modifications but they either don't work or cause a crash. 我尝试了其他一些修改,但它们要么不起作用,要么会导致崩溃。

if(any(word in stem_words for word[0:len(word)] in word_tokenized_abstract)):
    do stuff

That is, do any of the values word_tokenized_abstract begin with any of the values in stem_words ? 也就是说,word_tokenized_abstract的任何值是否都以stem_words任何值stem_words

if it helps, my stem_words = ['pancrea', 'muscul', 'derma', 'ovar'] 如果有帮助,我的stem_words = ['pancrea', 'muscul', 'derma', 'ovar']

Thanks! 谢谢! I apologize if this question has been answered previously but I couldn't find it. 如果这个问题先前已得到解答,我很抱歉,但我找不到它。

So you want to check if any string in a first list is contained in any of the strings of the second list. 因此,您要检查第二个列表的任何字符串中是否包含第一个列表中的任何字符串。

I'd try this: 我会尝试这样的:

any(y.startswith(x) for y in word_tokenized_abstract for x in stem_words)

Explanation: for each stem x in stem_words check if any string in word_tokenized_abstract starts with x . 说明:每个干xstem_words检查是否在任何字符串word_tokenized_abstract开头x

If you just want the stem to be a substring of the word then use: 如果只希望词干成为单词的子串,请使用:

any(x in y for y in word_tokenized_abstract for x in stem_words)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM