如何在python中有效地搜索列表？

Question

我有一个句子列表，如下。

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

我还有一个按字母顺序排列的概念列表组，如下所示。

concepts = ['data mining', 'database systems', 'databases process', 
            'interdisciplinary subfield', 'information', 'knowledege discovery',
            'methods', 'machine learning', 'patterns', 'process']

我想按sentences顺序确定sentences中的concepts 。

因此，根据上述示例，输出应为：

output = [['data mining','process','patterns','methods','machine learning','database systems'],
          ['data mining','interdisciplinary subfield','information'],
          ['data mining','knowledge discovery','databases process']]

我正在使用以下代码来做到这一点。

for sentence in sentences:
    sentence_tokens = []
    for item in concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    counting = counting+1
    print(counting)
    output.append(sentence_tokens)

但是，这确实很慢，根据我的时间计算，处理我的数据集大约需要半个月的时间。

我的概念列表长约13,242,627（即len(concepts) ），我约有350,000个句子（即len(sentences) ）。

因此，我只是想知道是否可以通过使用字母顺序来搜索我的概念列表的一部分？ 否则如果我在句子中搜索概念（例如， concept in concepts和for sentence in sentences的内部循环）会减少时间

Answer 1

最初，我考虑实现某种字符串搜索算法，但后来我意识到regexp模块可能已经包含了一个不错的模块。

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']
concepts = ['data mining', 'database systems', 'databases process', 'interdisciplinary subfield', 'information', 'knowledege discovery','methods', 'machine learning', 'patterns', 'process']

import re
re_group = "(" + "|".join(map(re.escape, concepts)) + ")"
output = [re.findall(re_group, sentence) for sentence in sentences]
print(output)

（感谢@warvariuc的建议在地图中包含re.escape和代码高尔夫）

Answer 2

您可能会发现有用的数据结构称为“ trie”或“前缀树”（ https://en.wikipedia.org/wiki/Trie ）。 该解决方案将遍历句子中与最长词匹配中最长的前缀匹配的单词，如果没有前缀匹配，则跳至下一个单词。 在最坏的情况下，查找将为O（m）； m是要匹配的字符串的长度。 这意味着您将找到所有概念，其代价是“句子”长度的最坏情况。 相比之下，您的算法需要花费一定数量的概念列表长度，这有点吓人。

如何在python中有效地搜索列表？

问题描述

2 个解决方案

解决方案1
3 已采纳 2019-01-06 14:48:00

解决方案2
2 2019-01-06 14:46:03

如何在python中有效地搜索列表？

问题描述

2 个解决方案

解决方案1 3 已采纳 2019-01-06 14:48:00

解决方案2 2 2019-01-06 14:46:03

解决方案1
3 已采纳 2019-01-06 14:48:00

解决方案2
2 2019-01-06 14:46:03