簡體   English   中英

如何在python中有效地搜索字符串中的列表元素

[英]How to efficiently search list elements in a string in python

我有一個概念列表( myconcepts )和一個句子列表( sentences )如下。

concepts = [['natural language processing', 'text mining', 'texts', 'nlp'], ['advanced data mining', 'data mining', 'data'], ['discourse analysis', 'learning analytics', 'mooc']]


sentences = ['data mining and text mining', 'nlp is mainly used by discourse analysis community', 'data mining in python is fun', 'mooc data analysis involves texts', 'data and data mining are both very interesting']

簡而言之,我想在sentences找到concepts 更具體地說,給定一個concepts列表(例如, ['natural language processing', 'text mining', 'texts', 'nlp'] ),我想在句子中識別這些概念並用它的第一個元素替換它們(即natural language processing )。

示例:那么,如果我們考慮句子data mining and text mining ; 結果應該是advanced data mining and natural language processing (因為data miningtext mining的第一要素分別是advanced data miningnatural language processing )。

上述虛擬數據的結果應為:

['advanced data mining and natural language processing', 'natural language processing is mainly used by discourse analysis community', 'advanced data mining in python is fun', 'discourse analysis advanced data mining analysis involves natural language processing', 'advanced data mining and advanced data mining are both very interesting']

我目前正在使用正則表達式執行此操作,如下所示:

concepts_re = []

for item in sorted_wikipedia_redirects:
        item_re = "|".join(re.escape(item) for item in item)
        concepts_re.append(item_re)

sentences_mapping = []

for sentence in sentences:
    for terms in concepts:
        if len(terms) > 1:
            for item in terms:
                if item in sentence:
                    sentence = re.sub(concepts_re[concepts.index(terms)], item[0], sentence)
    sentences_mapping.append(sentence)

在我的真實數據集中,我有大約800萬個concepts 因此,我的方法非常低效,處理一個句子需要5分鍾。 我想知道在python中是否有任何有效的方法。

對於那些想要處理一長串concepts來衡量時間的人,我附上了一個更長的列表: https//drive.google.com/file/d/1OsggJTDZx67PGH4LupXIkCTObla0gDnX/view? usp =sharing

如果需要,我很樂意提供更多細節。

下面提供的解決方案在運行時具有大約O(n)的復雜度,其中n是每個句子中的令牌數。

對於500萬個句子和你的concepts.txt它在~30秒內執行所需的操作,參見第三部分的基本測試。

當談到空間復雜性時,你將不得不保留一個嵌套的字典結構(讓我們現在就這樣簡化它),說它是O(c * u) ,其中u是特定長度概念的唯一令牌(令牌方式) ,而c是概念的長度。

很難確定確切的復雜性,但它與此非常相似(對於您的示例數據和您提供的那些[ concepts.txt ],這些非常准確,但在我們完成實現時,我們將得到血腥的細節)。

我假設您可以在空格上分割您的概念和句子,如果不是這種情況,我建議您查看spaCy ,它提供了更智能的方式來標記您的數據。

1.簡介

我們舉個例子:

concepts = [
    ["natural language processing", "text mining", "texts", "nlp"],
    ["advanced data mining", "data mining", "data"],
    ["discourse analysis", "learning analytics", "mooc"],
]

正如你所說,概念中的每個元素都必須映射到第一個元素,因此,在Pythonish中,它將大致沿着這些方向:

for concept in concepts:
    concept[1:] = concept[0]

如果所有概念的令牌長度都等於1(這里不是這種情況),那么任務就很容易了,並且是唯一的。 讓我們關注第二種情況和一個特定的(一點點修改的) concept例子來看我的觀點:

["advanced data mining", "data something", "data"]

這里data將被映射到advanced data miningBUT data something ,其中包含data應該在它之前映射。 如果我理解正確,你會想要這句話:

"Here is data something and another data"

要映射到:

"Here is advanced data mapping and another advanced data mining"

而不是天真的方法:

"Here is advanced data mapping something and another advanced data mining"

看到第二個例子,我們只映射data ,而不是data something

為了優先處理data something (以及其他符合此模式的數據),我使用了一個填充了字典的數組結構,其中數組中較早的概念是那些較長的標記方式。

繼續我們的例子,這樣的數組看起來像這樣:

structure = [
    {"data": {"something": "advanced data mining"}},
    {"data": "advanced data mining"},
]

請注意,如果我們按此順序瀏覽令牌(例如,首先通過帶有連續令牌的第一個字典,如果未找到匹配,請轉到第二個字典,依此類推),我們將首先獲得最長的概念。

2.代碼

好的,我希望你能得到基本的想法(如果沒有,請在下面發表評論,我將嘗試更詳細地解釋不清楚的部分)。

免責聲明:我對此代碼並不特別自豪,但它完成了工作,我想可能會更糟糕

2.1分層字典

首先,讓我們獲得最長的概念令牌(不包括第一個元素,因為它是我們的目標,我們不必永遠改變它):

def get_longest(concepts: List[List[str]]):
    return max(len(text.split()) for concept in concepts for text in concept[1:])

使用這些信息,我們可以通過創建與不同長度的概念一樣多的字典來初始化我們的結構(在上面的示例中它將是2,因此它將適用於您的所有數據。任何長度的概念都可以):

def init_hierarchical_dictionaries(longest: int):
    return [(length, {}) for length in reversed(range(longest))]

注意我正在將每個概念的長度添加到數組中 ,IMO在遍歷時更容易,但是在對實現進行一些更改之后,你可以不使用它。

現在,當我們有這些輔助函數時,我們可以從概念列表中創建結構:

def create_hierarchical_dictionaries(concepts: List[List[str]]):
    # Initialization
    longest = get_longest(concepts)
    hierarchical_dictionaries = init_hierarchical_dictionaries(longest)

    for concept in concepts:
        for text in concept[1:]:
            tokens = text.split()
            # Initialize dictionary; get the one with corresponding length.
            # The longer, the earlier it is in the hierarchy
            current_dictionary = hierarchical_dictionaries[longest - len(tokens)][1]
            # All of the tokens except the last one are another dictionary mapping to
            # the next token in concept.
            for token in tokens[:-1]:
                current_dictionary[token] = {}
                current_dictionary = current_dictionary[token]

            # Last token is mapped to the first concept
            current_dictionary[tokens[-1]] = concept[0].split()

    return hierarchical_dictionaries

此函數將創建我們的分層字典,請參閱源代碼中的注釋以獲得一些解釋。 你可能想要創建一個保留這個東西的自定義類,它應該更容易使用這種方式。

這與1.簡介中描述的完全相同

2.2遍歷詞典

這部分要困難得多,但這次讓我們使用上下方法。 我們將輕松開始:

def embed_sentences(sentences: List[str], hierarchical_dictionaries):
    return (traverse(sentence, hierarchical_dictionaries) for sentence in sentences)

提供了分層詞典,它創建了一個生成器,根據概念映射轉換每個句子。

現在traverse功能:

def traverse(sentence: str, hierarchical_dictionaries):
    # Get all tokens in the sentence
    tokens = sentence.split()
    output_sentence = []
    # Initialize index to the first token
    index = 0
    # Until any tokens left to check for concepts
    while index < len(tokens):
        # Iterate over hierarchical dictionaries (elements of the array)
        for hierarchical_dictionary_tuple in hierarchical_dictionaries:
            # New index is returned based on match and token-wise length of concept
            index, concept = traverse_through_dictionary(
                index, tokens, hierarchical_dictionary_tuple
            )
            # Concept was found in current hierarchical_dictionary_tuple, let's add it
            # to output
            if concept is not None:
                output_sentence.extend(concept)
                # No need to check other hierarchical dictionaries for matching concept
                break
        # Token (and it's next tokens) do not match with any concept, return original
        else:
            output_sentence.append(tokens[index])
        # Increment index in order to move to the next token
        index += 1

    # Join list of tokens into a sentence
    return " ".join(output_sentence)

再一次,如果您不確定發生了什么,請發表評論

使用這種方法,悲觀地說,我們將執行O(n * c!)檢查,其中n是句子中的標記數,c是最長概念的標記長度,它是階乘。 這種情況極不可能在實踐中發生,句子中的每個標記都必須幾乎完全符合最長的概念加上所有較短的概念必須是最短的概念(如super data miningsuper datadata )。

對於任何實際問題,它會接近O(n),正如我之前所說,使用你在.txt文件中提供的數據,它是O(3 * n)最壞情況,通常是O(2 * n) 。

遍歷每個字典

def traverse_through_dictionary(index, tokens, hierarchical_dictionary_tuple):
    # Get the level of nested dictionaries and initial dictionary
    length, current_dictionary = hierarchical_dictionary_tuple
    # inner_index will loop through tokens until match or no match was found
    inner_index = index
    for _ in range(length):
        # Get next nested dictionary and move inner_index to the next token
        current_dictionary = current_dictionary.get(tokens[inner_index])
        inner_index += 1
        # If no match was found in any level of dictionary
        # Return current index in sentence and None representing lack of concept.
        if current_dictionary is None or inner_index >= len(tokens):
            return index, None

    # If everything went fine through all nested dictionaries, check whether
    # last token corresponds to concept
    concept = current_dictionary.get(tokens[inner_index])
    if concept is None:
        return index, None
    # If so, return inner_index (we have moved length tokens, so we have to update it)
    return inner_index, concept

這構成了我解決方案的“肉”。

3.結果

現在,為簡潔起見,下面提供了完整的源代碼( concepts.txt是您提供的代碼):

import ast
import time
from typing import List


def get_longest(concepts: List[List[str]]):
    return max(len(text.split()) for concept in concepts for text in concept[1:])


def init_hierarchical_dictionaries(longest: int):
    return [(length, {}) for length in reversed(range(longest))]


def create_hierarchical_dictionaries(concepts: List[List[str]]):
    # Initialization
    longest = get_longest(concepts)
    hierarchical_dictionaries = init_hierarchical_dictionaries(longest)

    for concept in concepts:
        for text in concept[1:]:
            tokens = text.split()
            # Initialize dictionary; get the one with corresponding length.
            # The longer, the earlier it is in the hierarchy
            current_dictionary = hierarchical_dictionaries[longest - len(tokens)][1]
            # All of the tokens except the last one are another dictionary mapping to
            # the next token in concept.
            for token in tokens[:-1]:
                current_dictionary[token] = {}
                current_dictionary = current_dictionary[token]

            # Last token is mapped to the first concept
            current_dictionary[tokens[-1]] = concept[0].split()

    return hierarchical_dictionaries


def traverse_through_dictionary(index, tokens, hierarchical_dictionary_tuple):
    # Get the level of nested dictionaries and initial dictionary
    length, current_dictionary = hierarchical_dictionary_tuple
    # inner_index will loop through tokens until match or no match was found
    inner_index = index
    for _ in range(length):
        # Get next nested dictionary and move inner_index to the next token
        current_dictionary = current_dictionary.get(tokens[inner_index])
        inner_index += 1
        # If no match was found in any level of dictionary
        # Return current index in sentence and None representing lack of concept.
        if current_dictionary is None or inner_index >= len(tokens):
            return index, None

    # If everything went fine through all nested dictionaries, check whether
    # last token corresponds to concept
    concept = current_dictionary.get(tokens[inner_index])
    if concept is None:
        return index, None
    # If so, return inner_index (we have moved length tokens, so we have to update it)
    return inner_index, concept


def traverse(sentence: str, hierarchical_dictionaries):
    # Get all tokens in the sentence
    tokens = sentence.split()
    output_sentence = []
    # Initialize index to the first token
    index = 0
    # Until any tokens left to check for concepts
    while index < len(tokens):
        # Iterate over hierarchical dictionaries (elements of the array)
        for hierarchical_dictionary_tuple in hierarchical_dictionaries:
            # New index is returned based on match and token-wise length of concept
            index, concept = traverse_through_dictionary(
                index, tokens, hierarchical_dictionary_tuple
            )
            # Concept was found in current hierarchical_dictionary_tuple, let's add it
            # to output
            if concept is not None:
                output_sentence.extend(concept)
                # No need to check other hierarchical dictionaries for matching concept
                break
        # Token (and it's next tokens) do not match with any concept, return original
        else:
            output_sentence.append(tokens[index])
        # Increment index in order to move to the next token
        index += 1

    # Join list of tokens into a sentence
    return " ".join(output_sentence)


def embed_sentences(sentences: List[str], hierarchical_dictionaries):
    return (traverse(sentence, hierarchical_dictionaries) for sentence in sentences)


def sanity_check():
    concepts = [
        ["natural language processing", "text mining", "texts", "nlp"],
        ["advanced data mining", "data mining", "data"],
        ["discourse analysis", "learning analytics", "mooc"],
    ]
    sentences = [
        "data mining and text mining",
        "nlp is mainly used by discourse analysis community",
        "data mining in python is fun",
        "mooc data analysis involves texts",
        "data and data mining are both very interesting",
    ]

    targets = [
        "advanced data mining and natural language processing",
        "natural language processing is mainly used by discourse analysis community",
        "advanced data mining in python is fun",
        "discourse analysis advanced data mining analysis involves natural language processing",
        "advanced data mining and advanced data mining are both very interesting",
    ]

    hierarchical_dictionaries = create_hierarchical_dictionaries(concepts)

    results = list(embed_sentences(sentences, hierarchical_dictionaries))
    if results == targets:
        print("Correct results")
    else:
        print("Incorrect results")


def speed_check():
    with open("./concepts.txt") as f:
        concepts = ast.literal_eval(f.read())

    initial_sentences = [
        "data mining and text mining",
        "nlp is mainly used by discourse analysis community",
        "data mining in python is fun",
        "mooc data analysis involves texts",
        "data and data mining are both very interesting",
    ]

    sentences = initial_sentences.copy()

    for i in range(1_000_000):
        sentences += initial_sentences

    start = time.time()
    hierarchical_dictionaries = create_hierarchical_dictionaries(concepts)
    middle = time.time()
    letters = []
    for result in embed_sentences(sentences, hierarchical_dictionaries):
        letters.append(result[0].capitalize())
    end = time.time()
    print(f"Time for hierarchical creation {(middle-start) * 1000.0} ms")
    print(f"Time for embedding {(end-middle) * 1000.0} ms")
    print(f"Overall time elapsed {(end-start) * 1000.0} ms")


def main():
    sanity_check()
    speed_check()


if __name__ == "__main__":
    main()

速度檢查結果如下:

Time for hierarchical creation 107.71822929382324 ms
Time for embedding 30460.427284240723 ms
Overall time elapsed 30568.145513534546 ms

因此,對於500萬個句子(你提供的5個句子連續100萬次),以及你提供的概念文件(1.1 mb),執行概念映射大約需要30秒,我想這也不錯。

在最壞的情況下,字典應該采用與輸入文件一樣多的內存(在本例中為concepts.txt ),但通常會更低/更低,因為它取決於概念長度和這些單詞的唯一單詞的組合。

使用后綴數組方法,

如果您的數據已經過清理,請跳過此步驟。

首先,清理您的數據,用您知道不屬於任何概念或句子的任何字符替換所有空格字符。

然后為所有句子構建后綴數組。 這需要每個句子的O(nLogn)時間。 很少有算法可以使用后綴樹在O(n)時間內完成此操作

一旦為所有句子准備好后綴數組,只需對您的概念執行二進制搜索。

您可以使用LCP陣列進一步優化搜索。 參考: kasai's

使用LCP和后綴數組,搜索的時間復雜度可以降低到O(n)。

編輯:這種方法通常用於基因組序列比對,也很受歡迎。 您應該很容易找到適合您的實現。

import re
concepts = [['natural language processing', 'text mining', 'texts', 'nlp'], ['advanced data mining', 'data mining', 'data'], ['discourse analysis', 'learning analytics', 'mooc']]
sentences = ['data mining and text mining', 'nlp is mainly used by discourse analysis community', 'data mining in python is fun', 'mooc data analysis involves texts', 'data and data mining are both very interesting']

replacementDict = {concept[0] : concept[1:] for concept in concepts}

finderAndReplacements = [(re.compile('(' + '|'.join(replacees) + ')'), replacement) 
for replacement, replacees in replacementDict.items()]

def sentenceReplaced(findRegEx, replacement, sentence):
    return findRegEx.sub(replacement, sentence, count=0)

def sentencesAllReplaced(sentences, finderAndReplacements=finderAndReplacements):
    for regex, replacement in finderAndReplacements:
        sentences = [sentenceReplaced(regex, replacement, sentence) for sentence in sentences]
    return sentences

print(sentencesAllReplaced(sentences))
  • 設置:我更喜歡用dict表示的concepts ,其中鍵,值是替換,替換。 replacementDict存儲了這個
  • 為每個預期的替換組編譯匹配的正則表達式。 將其與預期替換一起存儲在finderAndReplacements列表中。
  • sentenceReplaced函數在執行替換后返回輸入句子。 (此處的申請順序無關緊要,因此如果我們注意避免競爭條件,應該可以進行並行化。)
  • 最后,我們循環並找到/替換每個句子。 (大量的並行結構可以提供更好的性能。)

我希望看到一些徹底的基准測試/測試/報告,因為我確信有很多細微之處取決於這個任務輸入( conceptssentences )的性質和運行它的硬件。

在這種情況下, sentences是主要的輸入組件,與concepts替換相比,我認為編譯正則表達式將是有利的。 當句子很少且概念很多時,特別是如果大多數概念都不在任何句子中,編譯這些匹配器將是一種浪費。 如果每次替換都有很多替換,則使用的編譯方法可能表現不佳甚至出錯。 (對輸入參數的不同假設提供了許多權衡因素,通常情況就是如此。)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM