简体   繁体   English

python中如何按照字符串的顺序高效识别子字符串

[英]How to efficiently identify substrings in the order of the string in python

This is related to my previous question in: How to identify substrings in the order of the string?这与我之前的问题有关: 如何按字符串的顺序识别子字符串?

For a given set of sentences and a set of selected_concepts I want to identify selected_concepts in the order of the sentences .对于给定的一组sentences和一组selected_concepts我想按照sentences的顺序识别selected_concepts

I am doing it fine with the code provided below.我用下面提供的代码做得很好。

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    output.append(sentence_tokens)

However, in my real dataset I have 13242627 selected_concepts and 1234952 sentences .但是,在我的真实数据集中,我有 13242627 selected_concepts和 1234952 sentences Therefore, I would like to know if there is any way to optimise this code to perform in lesser time.因此,我想知道是否有任何方法可以优化此代码以在更短的时间内执行。 As I understand this is O(n^2).据我了解,这是 O(n^2)。 Therefore, I am concerned about the time complexity (space complexity is not a problem for me).因此,我关心的是时间复杂度(空间复杂度对我来说不是问题)。

A sample is mentioned below.下面提到了一个示例。

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']

selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']

output = [['data mining','process','patterns','methods','machine learning','database systems'],['data mining','interdisciplinary subfield','information'],['data mining','knowledge discovery','databases process']]

What about using pre-compiled ReGEx?使用预编译的 ReGEx 怎么样?

Here is an example:下面是一个例子:

import re

sentences = [
    'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
    'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
    'data mining is the analysis step of the knowledge discovery in databases process or kdd']

selected_concepts = [
    'machine learning',
    'patterns',
    'data mining',
    'methods',
    'database systems',
    'interdisciplinary subfield',
    'knowledege discovery',  # spelling error: “knowledge”
    'databases process',
    'information',
    'process']

re_concepts = [re.escape(t) for t in selected_concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall

output = [find_all_concepts(sentence) for sentence in sentences]

You get:你得到:

[['data mining',
  'process',
  'patterns',
  'methods',
  'machine learning',
  'database systems'],
 ['data mining', 'interdisciplinary subfield', 'information', 'information'],
 ['data mining', 'databases process']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM