[英]How to efficiently identify substrings in the order of the string in python
This is related to my previous question in: How to identify substrings in the order of the string?这与我之前的问题有关: 如何按字符串的顺序识别子字符串?
For a given set of sentences
and a set of selected_concepts
I want to identify selected_concepts
in the order of the sentences
.对于给定的一组sentences
和一组selected_concepts
我想按照sentences
的顺序识别selected_concepts
。
I am doing it fine with the code provided below.我用下面提供的代码做得很好。
output = []
for sentence in sentences:
sentence_tokens = []
for item in selected_concepts:
index = sentence.find(item)
if index >= 0:
sentence_tokens.append((index, item))
sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
output.append(sentence_tokens)
However, in my real dataset I have 13242627 selected_concepts
and 1234952 sentences
.但是,在我的真实数据集中,我有 13242627 selected_concepts
和 1234952 sentences
。 Therefore, I would like to know if there is any way to optimise this code to perform in lesser time.因此,我想知道是否有任何方法可以优化此代码以在更短的时间内执行。 As I understand this is O(n^2).据我了解,这是 O(n^2)。 Therefore, I am concerned about the time complexity (space complexity is not a problem for me).因此,我关心的是时间复杂度(空间复杂度对我来说不是问题)。
A sample is mentioned below.下面提到了一个示例。
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']
selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']
output = [['data mining','process','patterns','methods','machine learning','database systems'],['data mining','interdisciplinary subfield','information'],['data mining','knowledge discovery','databases process']]
What about using pre-compiled ReGEx?使用预编译的 ReGEx 怎么样?
Here is an example:下面是一个例子:
import re
sentences = [
'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd']
selected_concepts = [
'machine learning',
'patterns',
'data mining',
'methods',
'database systems',
'interdisciplinary subfield',
'knowledege discovery', # spelling error: “knowledge”
'databases process',
'information',
'process']
re_concepts = [re.escape(t) for t in selected_concepts]
find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall
output = [find_all_concepts(sentence) for sentence in sentences]
You get:你得到:
[['data mining',
'process',
'patterns',
'methods',
'machine learning',
'database systems'],
['data mining', 'interdisciplinary subfield', 'information', 'information'],
['data mining', 'databases process']]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.