简体   繁体   English

降低这个程序的时间复杂度

[英]Reducing the time complexity of this Program

Question - Write a function called answer(document, searchTerms) which returns the shortest snippet of the document, containing all of the given search terms.问题 - 编写一个名为 answer(document, searchTerms) 的函数,它返回文档的最短片段,其中包含所有给定的搜索词。 The search terms can appear in any order.搜索词可以按任何顺序出现。

Inputs:
(string) document = "many google employees can program"
(string list) searchTerms = ["google", "program"]
Output:
(string) "google employees can program"

 Inputs:
(string) document = "a b c d a"
(string list) searchTerms = ["a", "c", "d"]
 Output:
(string) "c d a"

My program below is giving the correct answer but the time complexity is very high since I am doing the Cartesian product.我下面的程序给出了正确的答案,但时间复杂度非常高,因为我正在做笛卡尔积。 If the input is very high then I am not able to clear to those test cases.如果输入非常高,那么我无法清除那些测试用例。 I am not able to reduce the complexity of this program, and any help will be greatly appreciated.我无法降低这个程序的复杂性,任何帮助将不胜感激。 Thanks谢谢

import itertools

import sys

def answer(document, searchTerms):

    min = sys.maxint

    matchedString = ""

    stringList = document.split(" ")

    d = dict()

    for j in range(len(searchTerms)):

        for i in range(len(stringList)):

            if searchTerms[j] == stringList[i]:

                d.setdefault(searchTerms[j], []).append(i)

    for element in itertools.product(*d.values()):

        sortedList = sorted(list(element))

        differenceList = [t - s for s, t in zip(sortedList, sortedList[1:])]

       if min > sum(differenceList):

          min = sum(differenceList)
          sortedElement = sortedList

          if sum(differenceList) == len(sortedElement) - 1:
            break

    try:
        for i in range(sortedElement[0], sortedElement[len(sortedElement)-1]+1):

            matchedString += "".join(stringList[i]) + " "

    except:
        pass

    return matchedString

If anyone wants to clone my program here is code如果有人想克隆我的程序,这里是代码

One solution would be to iterate through the document using two indices ( start and stop ).一种解决方案是使用两个索引( startstop )遍历文档。 You simply keep track of how many of each of the searchTerms are between start and stop .您只需跟踪startstop之间的每个searchTerms数量。 If not all are present you increase stop until they are (or you reach the end of the document).如果不是全部都存在,则增加stop直到它们出现(或到达文档末尾)。 When all are present you increase start until before all searchTerms are no longer present.当所有都存在时,您增加start直到所有searchTerms不再存在之前。 Whenever all searchTerms are present you check if that candidate is better than previous candidates.每当出现所有searchTerms时,您都会检查该候选人是否比以前的候选人更好。 This should be able to be done in O(N) time (with limited number of search terms or the search terms are put in a set with O(1) lookup).这应该能够在O(N)时间内完成(搜索词数量有限,或者搜索词放在一个集合中,使用O(1)查找)。 Something like:就像是:

start = 0
stop = 0
counts = dict()
cand_start = None
cand_end = None

while stop < len(document):
    if len(counts) < len(searchTerms):
         term = document[stop]
         if term in searchTerms:
             if term not in counts:
                  counts[term] = 1
             else:
                  counts[term] += 1
    else:
        if cand_start is None or stop-start < cand_stop-cand_start:
           cand_start = start
           cand_stop = stop
        term = document[start]
        if term in counts:
            if counts[start] == 1:
               del counts[start]
            else:
               counts[start] -= 1
        start += 1

The Aho-Corasick algorithm will search a document for multiple search terms in linear time.Aho-Corasick 算法将在线性时间内搜索多个搜索词的文档。 It works by building a finite state automaton from the search terms, and then running the document through that automaton.它的工作原理是根据搜索词构建一个有限状态自动机,然后通过该自动机运行文档。

So build the FSA and start the search.因此,构建 FSA 并开始搜索。 As search terms are found, store them in an array of tuples (search term, position).找到搜索词后,将它们存储在元组数组中(搜索词、位置)。 When you've found all of the search terms, stop the search.找到所有搜索词后,停止搜索。 The last item in your list will contain the last search term found.列表中的最后一项将包含最后找到的搜索词。 That gives you the ending position of the range.这为您提供了范围的结束位置。 Then search backwards in that list of found terms until all of the terms are found.然后在找到的术语列表中向后搜索,直到找到所有术语。

So if you're searching for ["cat", "dog", "boy", "girl"], you might get something like:因此,如果您正在搜索 ["cat", "dog", "boy", "girl"],您可能会得到如下内容:

cat - 15
boy - 27
cat - 50
girl - 97
boy - 202
dog - 223

So you know the end of the range is 226, and searching backward you find all four terms, with the last one being "cat" at position 50.所以你知道范围的结尾是 226,向后搜索你会找到所有四个术语,最后一个是位置 50 处的“cat”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM