簡體   English   中英

如何在python中有效地搜索列表?

[英]How to search a list effectively in python?

我有一個句子列表,如下。

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

我還有一個按字母順序排列的概念列表組,如下所示。

concepts = ['data mining', 'database systems', 'databases process', 
            'interdisciplinary subfield', 'information', 'knowledege discovery',
            'methods', 'machine learning', 'patterns', 'process']

我想按sentences順序確定sentences中的concepts

因此,根據上述示例,輸出應為:

output = [['data mining','process','patterns','methods','machine learning','database systems'],
          ['data mining','interdisciplinary subfield','information'],
          ['data mining','knowledge discovery','databases process']]

我正在使用以下代碼來做到這一點。

for sentence in sentences:
    sentence_tokens = []
    for item in concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    counting = counting+1
    print(counting)
    output.append(sentence_tokens)

但是,這確實很慢,根據我的時間計算,處理我的數據集大約需要半個月的時間。

我的概念列表長約13,242,627(即len(concepts) ),我約有350,000個句子(即len(sentences) )。

因此,我只是想知道是否可以通過使用字母順序來搜索我的概念列表的一部分? 否則如果我在句子中搜索概念(例如, concept in conceptsfor sentence in sentences的內部循環)會減少時間

最初,我考慮實現某種字符串搜索算法 ,但后來我意識到regexp模塊可能已經包含了一個不錯的模塊。

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']
concepts = ['data mining', 'database systems', 'databases process', 'interdisciplinary subfield', 'information', 'knowledege discovery','methods', 'machine learning', 'patterns', 'process']

import re
re_group = "(" + "|".join(map(re.escape, concepts)) + ")"
output = [re.findall(re_group, sentence) for sentence in sentences]
print(output)

(感謝@warvariuc的建議在地圖中包含re.escape和代碼高爾夫)

您可能會發現有用的數據結構稱為“ trie”或“前綴樹”( https://en.wikipedia.org/wiki/Trie )。 該解決方案將遍歷句子中與最長詞匹配中最長的前綴匹配的單詞,如果沒有前綴匹配,則跳至下一個單詞。 在最壞的情況下,查找將為O(m); m是要匹配的字符串的長度。 這意味着您將找到所有概念,其代價是“句子”長度的最壞情況。 相比之下,您的算法需要花費一定數量的概念列表長度,這有點嚇人。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM