How to search a list effectively in python?

Question

I have a list of sentences as follows.

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

I also have a concepts list group by alphabetically as follows.

concepts = ['data mining', 'database systems', 'databases process', 
            'interdisciplinary subfield', 'information', 'knowledege discovery',
            'methods', 'machine learning', 'patterns', 'process']

I want to identify the concepts in the sentences in the order of the sentence.

So, according to the above example the output should be;

output = [['data mining','process','patterns','methods','machine learning','database systems'],
          ['data mining','interdisciplinary subfield','information'],
          ['data mining','knowledge discovery','databases process']]

I am using the following code to do it.

for sentence in sentences:
    sentence_tokens = []
    for item in concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    counting = counting+1
    print(counting)
    output.append(sentence_tokens)

However, this is really slow and according to my time calculation it would take like half a month to process my dataset.

My concept list is about 13,242,627 long (ie len(concepts) ) and I have about 350,000 senetences (ie len(sentences) ).

Therefore, I am just wondering if it is possible to search part of my concept list by using the alphabetical order? or it would reduce time if I search concepts within sentences (ie for concept in concepts and the inner loop as for sentence in sentences )

Answer 1

At first I thought about implementing some string-searching algorithm , but then I realized that the regexp module probably already has a good one in it.

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']
concepts = ['data mining', 'database systems', 'databases process', 'interdisciplinary subfield', 'information', 'knowledege discovery','methods', 'machine learning', 'patterns', 'process']

import re
re_group = "(" + "|".join(map(re.escape, concepts)) + ")"
output = [re.findall(re_group, sentence) for sentence in sentences]
print(output)

(Thanks to @warvariuc for the suggestion to include re.escape and code-golfing with map)

Answer 2

There is a data structure called a 'trie' or 'prefix tree' that you might find useful ( https://en.wikipedia.org/wiki/Trie ). The solution would iterate through the words in your sentence matching to the longest prefix match in the trie and jumping to the next word if there is no prefix match. In the worst case, lookups would be O(m); m being length of string to match. This means you'll find all concepts with a cost that is worst case the length of the 'sentence'. In comparison your algorithm costs on order the length of your concept list which is a little scary.

How to search a list effectively in python?

Question

2 answers

solution1
3 ACCPTED 2019-01-06 14:48:00

solution2
2 2019-01-06 14:46:03

How to search a list effectively in python?

Question

2 answers

solution1 3 ACCPTED 2019-01-06 14:48:00

solution2 2 2019-01-06 14:46:03

solution1
3 ACCPTED 2019-01-06 14:48:00

solution2
2 2019-01-06 14:46:03