简体   繁体   English

如何在python中有效地搜索列表?

[英]How to search a list effectively in python?

I have a list of sentences as follows. 我有一个句子列表,如下。

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

I also have a concepts list group by alphabetically as follows. 我还有一个按字母顺序排列的概念列表组,如下所示。

concepts = ['data mining', 'database systems', 'databases process', 
            'interdisciplinary subfield', 'information', 'knowledege discovery',
            'methods', 'machine learning', 'patterns', 'process']

I want to identify the concepts in the sentences in the order of the sentence. 我想按sentences顺序确定sentences中的concepts

So, according to the above example the output should be; 因此,根据上述示例,输出应为:

output = [['data mining','process','patterns','methods','machine learning','database systems'],
          ['data mining','interdisciplinary subfield','information'],
          ['data mining','knowledge discovery','databases process']]

I am using the following code to do it. 我正在使用以下代码来做到这一点。

for sentence in sentences:
    sentence_tokens = []
    for item in concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    counting = counting+1
    print(counting)
    output.append(sentence_tokens)

However, this is really slow and according to my time calculation it would take like half a month to process my dataset. 但是,这确实很慢,根据我的时间计算,处理我的数据集大约需要半个月的时间。

My concept list is about 13,242,627 long (ie len(concepts) ) and I have about 350,000 senetences (ie len(sentences) ). 我的概念列表长约13,242,627(即len(concepts) ),我约有350,000个句子(即len(sentences) )。

Therefore, I am just wondering if it is possible to search part of my concept list by using the alphabetical order? 因此,我只是想知道是否可以通过使用字母顺序来搜索我的概念列表的一部分? or it would reduce time if I search concepts within sentences (ie for concept in concepts and the inner loop as for sentence in sentences ) 否则如果我在句子中搜索概念(例如, concept in conceptsfor sentence in sentences的内部循环)会减少时间

At first I thought about implementing some string-searching algorithm , but then I realized that the regexp module probably already has a good one in it. 最初,我考虑实现某种字符串搜索算法 ,但后来我意识到regexp模块可能已经包含了一个不错的模块。

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']
concepts = ['data mining', 'database systems', 'databases process', 'interdisciplinary subfield', 'information', 'knowledege discovery','methods', 'machine learning', 'patterns', 'process']

import re
re_group = "(" + "|".join(map(re.escape, concepts)) + ")"
output = [re.findall(re_group, sentence) for sentence in sentences]
print(output)

(Thanks to @warvariuc for the suggestion to include re.escape and code-golfing with map) (感谢@warvariuc的建议在地图中包含re.escape和代码高尔夫)

There is a data structure called a 'trie' or 'prefix tree' that you might find useful ( https://en.wikipedia.org/wiki/Trie ). 您可能会发现有用的数据结构称为“ trie”或“前缀树”( https://en.wikipedia.org/wiki/Trie )。 The solution would iterate through the words in your sentence matching to the longest prefix match in the trie and jumping to the next word if there is no prefix match. 该解决方案将遍历句子中与最长词匹配中最长的前缀匹配的单词,如果没有前缀匹配,则跳至下一个单词。 In the worst case, lookups would be O(m); 在最坏的情况下,查找将为O(m); m being length of string to match. m是要匹配的字符串的长度。 This means you'll find all concepts with a cost that is worst case the length of the 'sentence'. 这意味着您将找到所有概念,其代价是“句子”长度的最坏情况。 In comparison your algorithm costs on order the length of your concept list which is a little scary. 相比之下,您的算法需要花费一定数量的概念列表长度,这有点吓人。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM