简体   繁体   English

如何按字符串顺序识别子字符串?

[英]How to identify substrings in the order of the string?

I have a list of sentences as below. 我有一个句子列表如下。

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']

I also have a set of selected concepts. 我也有一些选定的概念。

selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']

Now I want to select the concepts in seleceted_concepts from sentences in the order of the sentence. 现在,我想按sentences顺序从sentences中选择seleceted_concepts的概念。

ie my output should be as follows. 即我的输出应如下。

output = [['data mining','process','patterns','methods','machine learning','database systems'],['data mining','interdisciplinary subfield','information'],['data mining','knowledge discovery','databases process']]

I could extract the concepts in the sentences as follows. 我可以将句子中的概念提取如下。

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        if item in sentence:
             sentence_tokens.append(item)
    output.append(sentence_tokens)

However, I have troubles of organising the extracted concepts accoridng to sentence order. 但是,我很难根据句子的顺序来组织提取的概念。 Is there any easy way of doing it in python? 在python中有什么简单的方法吗?

One way to do it is to use .find() method to find the position of the substring and then sort by that value. 一种实现方法是使用.find()方法找到子字符串的位置,然后按该值排序。 For example: 例如:

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    output.append(sentence_tokens)

You could use .find() and .insert() instead. 您可以改用.find()和.insert()。 Something like: 就像是:

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        pos = sentence.find(item)
        if pos != -1:
             sentence_tokens.insert(pos, item)
    output.append(sentence_tokens)

The only problem would be overlap in the selected_concepts. 唯一的问题将是selected_concepts中的重叠。 For example, 'databases process' and 'process'. 例如,“数据库进程”和“进程”。 In this case, they would end up in the opposite of the order they are in in selected_concepts. 在这种情况下,它们将以与selected_concepts中的顺序相反的顺序结束。 You could potentially fix this with the following: 您可以使用以下方法解决此问题:

output = []
selected_concepts_multiplier = len(selected_concepts)
for sentence in sentences:
    sentence_tokens = []
    for k,item in selected_concepts:
        pos = sentence.find(item)
        if pos != -1:
             sentence_tokens.insert((selected_concepts_multiplier * pos) + k, item)
    output.append(sentence_tokens)

there is a built in statement called "in". 有一个称为“ in”的内置语句。 it can check is there any string in other string. 它可以检查其他字符串中是否有任何字符串。

sentences = [
'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd'
]

selected_concepts = [
 'machine learning',
 'patterns',
 'data mining',
 'methods','database systems',
 'interdisciplinary subfield','knowledege discovery',
 'databases process',
 'information',
 'process'
 ]

output = [] #prepare the output
for s in sentences: #now lets check each sentences
    output.append(list()) #add a list to output, so it will become multidimensional list
    for c in selected_concepts: #check all selected_concepts
        if c in s: #if there a selected concept in a sentence
            output[-1].append(c) #then add the selected concept to the last list in output

print(output)

You can use the fact that regular expressions search text in order, left to right, and disallow overlaps: 您可以使用以下事实:正则表达式按从左到右的顺序搜索文本,并且不允许重叠:

import re
concept_re = re.compile(r'\b(?:' +
    '|'.join(re.escape(concept) for concept in selected_concepts) + r')\b')
output = [match
        for sentence in sentences for match in concept_re.findall(sentence)]

output
# => ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems', 'data mining', 'interdisciplinary subfield', 'information', 'information', 'data mining', 'databases process']

This should also be faster than searching for concepts individually, since the algorithm regexps use is more efficient for this, as well as being completely implemented in low-level code. 这也应该比单独搜索概念快,因为算法正则表达式的使用效率更高,并且可以完全以低级代码实现。

There is one difference though - if a concept repeats itself within one sentence, your code will only give one appearance per sentence, while this code outputs them all. 但是有一个区别-如果一个概念在一个句子中重复一次,则您的代码每个句子只会出现一个外观,而此代码将全部输出。 If this is a meaningful difference, it is rather easy to dedupe a list. 如果这是有意义的区别,则对列表进行重复数据删除很容易。

Here I used a simple re.findall method if the pattern is matched in the string then re.findall will give the output as that matched pattern otherwise it will return an empty list based on that I wrote this code 在这里,我使用了一个简单的re.findall方法,如果模式在字符串中匹配,则re.findall将给出匹配模式的输出,否则将基于我编写此代码返回一个空列表

import re

selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']

output = []

for sentence in sentences:
    matched_concepts = []
    for selected_concept in selected_concepts:
        if re.findall(selected_concept, sentence):
            matched_concepts.append(selected_concept)
    output.append(matched_concepts)
print output

Output: 输出:

[['machine learning', 'patterns', 'data mining', 'methods', 'database systems', 'process'], ['data mining', 'interdisciplinary subfield', 'information'], ['data mining', 'databases process', 'process']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM