简体   繁体   English

用于查找2个字符串之间任意长度的所有共享子串的算法,然后计算字符串2中的出现次数?

[英]Algorithm for finding all of the shared substrings of any length between 2 strings, and then counting occurrences in string 2?

I've run into an unusual challenge and so far I'm unable to determine the most efficient algorithm to attack this. 我遇到了一个不寻常的挑战,到目前为止我无法确定最有效的攻击方法。


Given the following 2 strings as an example, find all commonly shared substrings between the 2 strings of any length, and count the number of occurrences of all of those shared substrings in string 2. Your algorithm also needs to be able to compute shared substrings between files containing strings that are up to 100MB or more in size. 给出以下2个字符串作为示例,查找任意长度的2个字符串之间的所有共享子字符串,并计算字符串2中所有这些共享子字符串的出现次数。您的算法还需要能够计算之间的共享子字符串包含最大100MB或更大的字符串的文件。

Example: 例:

String 1: ABCDE512ABC361EG51D 字符串1:ABCDE512ABC361EG51D

String 2: ADE5AHDW4131EG1DG5C 字符串2:ADE5AHDW4131EG1DG5C

Given these 2 strings this algorithm would find the following shared substrings: A,C,D,E,5,1,3,G,DE,E5,EG,G5,1D,DE5,1EG 给定这2个字符串,该算法将找到以下共享子串:A,C,D,E,5,1,3,G,DE,E5,EG,G5,1D,DE5,1EG

And then from these commonly shared substrings, we'd find how many occurences there are of each of them in string 2. 然后从这些共享的子串中,我们可以发现字符串2中每个子串的出现次数。

A: 2 occurrences in string 2 答:字符串2中出现2次

C: 1 occurence in string 2 C:字符串2中出现1次

D: 3 occurrences in string 2 D:字符串2中出现3次

etc.. 等等..


The first approach I took to wrap my head around this problem was brute forcing my way through computing the common shared substrings using 2 nested for loops - obviously the least efficient but it was a quick and dirty way to get an idea of what the expected outputs should be with smaller test input and the slowest possible time to run, which was around 2 minutes to compute all common shared substrings between 2 files containing ascii strings with a size of 50kb. 我用来解决这个问题的第一种方法是粗暴地强迫我使用2个嵌套for循环计算公共共享子串 - 显然效率最低但是这是一种快速而肮脏的方式来了解预期的输出应该使用较小的测试输入和最慢的运行时间,大约2分钟来计算包含ascii字符串,大小为50kb的2个文件之间的所有常见共享子字符串。 Upping the size to 1mb made this come to a screeching halt due to the massive number of total nested iterations that had to occur to compute this. 将大小增加到1mb使得由于计算此次必须发生的大量嵌套迭代而导致这种情况急剧停止。

The next approach was using trees - seeing how much memory I could trade off to optimize compute time. 接下来的方法是使用树 - 看看我可以用多少内存来优化计算时间。 This approach was much faster. 这种方法要快得多。 The same two 50kb files that took 2 minute with the brute force method were near instant. 使用蛮力方法花费2分钟的两个50kb文件几乎是即时的。 Running against 1mb files was very fast too still (seconds) but as I continued to test with larger and larger file sizes, I quickly began running into memory issues due to tree sizes. 对1mb文件运行速度非常快(秒)但是当我继续测试越来越大的文件大小时,由于树的大小,我很快就开始遇到内存问题。


Note: The string files will only ever contain ASCII characters! 注意:字符串文件只包含ASCII字符!


Edit: 编辑:

I'm escalating this a bit further, please see: 我正在进一步升级,请看:

https://gist.github.com/braydo25/f7a9ce7ce7ad7c5fb11ec511887789bc https://gist.github.com/braydo25/f7a9ce7ce7ad7c5fb11ec511887789bc

Here is some code illustrating the idea I presented in the comments above. 这里有一些代码说明了我在上面的评论中提出的想法。 Although it is runnable C++ code, it is more pseudo-code in the sense that the utilized data structures are surely not optimal but they allow a clear view on the algorithm. 虽然它是可运行的C ++代码,但在使用的数据结构肯定不是最优的意义上它是更多的伪代码,但它们允许清楚地查看算法。

struct Occurrence
{
    //The vectors contain indices to the first character of the occurrence in ...
    std::vector<size_t> s1;  // ... string 1 and ...
    std::vector<size_t> s2;  // ... string 2.
};

int main()
{
    //If you cannot load the entire strings in memory, a memory-mapped file might be
    //worth considering
    std::string s1 = "ABCDE512ABC361EG51D";
    std::string s2 = "ADE5AHDW4131EG1DG5C";

    //These vectors store the occurrences of substrings for the current and next length
    std::vector<Occurrence> occurrences, nextOccurrences;
    int length = 1;

    std::map<char, Occurrence> occurrenceMap;
    //Initialize occurrences
    for (int i = 0; i < s1.length(); ++i)
        occurrenceMap[s1[i]].s1.push_back(i);
    for (int i = 0; i < s2.length(); ++i)
        occurrenceMap[s2[i]].s2.push_back(i);

    for (auto& pair : occurrenceMap)
    {
        if (pair.second.s1.size() > 0 && pair.second.s2.size() > 0)
            occurrences.push_back(std::move(pair.second));
    }

    do
    {
        nextOccurrences.clear();

        std::cout << "Length " << length << std::endl;
        for(auto& o : occurrences)
        {
            std::cout << std::string(s1.c_str() + o.s1[0], length) << " occurred "
                      << o.s1.size() << " / " << o.s2.size() << " times." << std::endl;

            //Expand the occurrence
            occurrenceMap.clear();
            for (auto p : o.s1)
            {
                if (p + length < s1.length())
                    occurrenceMap[s1[p + length]].s1.push_back(p);
            }                   
            for (auto p : o.s2)
            {
                if (p + length < s2.length())
                occurrenceMap[s2[p + length]].s2.push_back(p);
            }
            for (auto& pair : occurrenceMap)
            {
                if (pair.second.s1.size() > 0 && pair.second.s2.size() > 0)
                    nextOccurrences.push_back(std::move(pair.second));
            }
        }

        ++length;
        std::swap(occurrences, nextOccurrences);

    } while (!occurrences.empty());


    return 0;
}

Output: 输出:

Length 1
1 occurred 3 / 3 times.
3 occurred 1 / 1 times.
5 occurred 2 / 2 times.
A occurred 2 / 2 times.
C occurred 2 / 1 times.
D occurred 2 / 3 times.
E occurred 2 / 2 times.
G occurred 1 / 2 times.
Length 2
1D occurred 1 / 1 times.
1E occurred 1 / 1 times.
DE occurred 1 / 1 times.
E5 occurred 1 / 1 times.
EG occurred 1 / 1 times.
G5 occurred 1 / 1 times.
Length 3
1EG occurred 1 / 1 times.
DE5 occurred 1 / 1 times.

The most amount of memory will be used during initialization because there will be an entry for every character of both input strings. 初始化期间将使用最多的内存,因为两个输入字符串的每个字符都有一个条目。 If you know the approximate length of the strings, you can choose a more appropriate index data type than size_t . 如果您知道字符串的大致长度,则可以选择比size_t更合适的索引数据类型。 The amount of memory needed is in the order of the input size. 所需的内存量按输入大小的顺序排列。 So two 100 MB files should be no problem for common computers. 因此,对于普通计算机来说,两个100 MB的文件应该没问题。 After the initialization (more specifically, after the first iteration of the loop), most of these data will be deleted because it is not needed any more. 在初始化之后(更具体地说,在循环的第一次迭代之后),大多数这些数据将被删除,因为不再需要它们。

Here's a C implementation based on traversing the suffix array of the concatenation of the inputs, with the help of the longest common prefix array. 这是一个基于遍历输入串联的后缀数组的C实现,借助最长的公共前缀数组。 You can replace the programming-contest-grade (O(n log^2 n)) suffix array implementation with a real one (O(n) or O(n log n)) for a large performance improvement. 您可以将编程竞争等级(O(n log ^ 2 n))后缀数组实现替换为实数(O(n)或O(n log n)),以获得较大的性能提升。 (EDIT: did this, with some other changes reflecting the asker's new requirements: https://github.com/eisenstatdavid/commonsub .) (编辑:这样做,其他一些变化反映了提问者的新要求: https//github.com/eisenstatdavid/commonsub 。)

#include <inttypes.h>
#include <limits.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>

typedef int_fast32_t I32;

#define Constrain(expression) _Static_assert(expression, #expression)
Constrain(CHAR_BIT == 8);
#define InputMaxBytes 80000000
Constrain(InputMaxBytes <= (INT_LEAST32_MAX - 2) / 2);
#define MaxLen (2 * InputMaxBytes + 2)
Constrain(MaxLen <= INT_FAST32_MAX / 2);

static I32 Len;
static I32 Begin2;
static signed char Buf[MaxLen];
static int_least32_t SufArr[MaxLen];
static int_least32_t SufRank[MaxLen];
static int_least32_t NewRank[MaxLen];
static int_least32_t *const LongCommPre = NewRank;  // aliased to save space
static uint_least64_t Bitmap2[(MaxLen >> 6) + 1];
static int_least32_t SparseCount2[(MaxLen >> 6) + 1];
static int_least32_t *const Stack = SufRank;  // aliased to save space

static void Slurp(const char *filename) {
  FILE *stream = fopen(filename, "r");
  if (stream == NULL) goto fail;
  I32 n = fread(Buf + Len, sizeof *Buf, InputMaxBytes + 1, stream);
  if (ferror(stream)) goto fail;
  if (n > InputMaxBytes) {
    fprintf(stderr, "%s: file is too large; increase InputMaxBytes\n",
            filename);
    exit(EXIT_FAILURE);
  }
  for (I32 i = 0; i < n; i++) {
    if (Buf[Len + i] < 0) {
      fprintf(stderr,
              "%s: file contains non-ASCII byte at offset %" PRIdFAST32 "\n",
              filename, i);
      exit(EXIT_FAILURE);
    }
  }
  Len += n;
  if (fclose(stream) == EOF) goto fail;
  return;
fail:
  perror(filename);
  exit(EXIT_FAILURE);
}

static I32 Radix;

static int CompareRankPairs(const void *iPtr, const void *jPtr) {
  I32 i = *(const int_least32_t *)iPtr;
  I32 j = *(const int_least32_t *)jPtr;
  if (SufRank[i] < SufRank[j]) return -1;
  if (SufRank[i] > SufRank[j]) return 1;
  I32 iRank = i + Radix < Len ? SufRank[i + Radix] : -2;
  I32 jRank = j + Radix < Len ? SufRank[j + Radix] : -2;
  if (iRank < jRank) return -1;
  if (iRank > jRank) return 1;
  return 0;
}

static void BuildSuffixArray(void) {
  for (I32 i = 0; i < Len; i++) {
    SufArr[i] = i;
    SufRank[i] = Buf[i];
  }
  for (Radix = 1; true; Radix *= 2) {
    qsort(SufArr, Len, sizeof *SufArr, CompareRankPairs);
    NewRank[0] = 0;
    for (I32 i = 1; i < Len; i++) {
      NewRank[i] = CompareRankPairs(&SufArr[i - 1], &SufArr[i]) == 0
                       ? NewRank[i - 1]
                       : NewRank[i - 1] + 1;
    }
    for (I32 i = 0; i < Len; i++) {
      SufRank[SufArr[i]] = NewRank[i];
    }
    if (NewRank[Len - 1] == Len - 1) break;
  }

  I32 lenCommPre = 0;
  for (I32 i = 0; i < Len; i++) {
    if (SufRank[i] == Len - 1) {
      LongCommPre[SufRank[i]] = -1;
      continue;
    }
    while (Buf[i + lenCommPre] == Buf[SufArr[SufRank[i] + 1] + lenCommPre]) {
      lenCommPre++;
    }
    LongCommPre[SufRank[i]] = lenCommPre;
    if (lenCommPre > 0) lenCommPre--;
  }
}

static I32 PopCount(uint_fast64_t x) {
  I32 v = 0;
  while (x != 0) {
    x &= x - 1;
    v++;
  }
  return v;
}

static void BuildCumCount2(void) {
  for (I32 i = 0; i < Len; i++) {
    if (SufArr[i] >= Begin2) {
      Bitmap2[i >> 6] |= UINT64_C(1) << (i & 63);
      SparseCount2[i >> 6]++;
    }
  }
  for (I32 i = 0; i < (Len >> 6); i++) {
    SparseCount2[i + 1] += SparseCount2[i];
  }
}

static I32 CumCount2(I32 i) {
  return SparseCount2[i >> 6] - PopCount(Bitmap2[i >> 6] >> (i & 63));
}

static void FindCommonStrings(void) {
  I32 lenCommPre = -1;
  for (I32 i = 0; i < Len; i++) {
    while (lenCommPre > LongCommPre[i]) {
      I32 begin = Stack[lenCommPre];
      I32 end = i + 1;
      I32 count2 = CumCount2(end) - CumCount2(begin);
      if (count2 > 0 && count2 < end - begin && lenCommPre > 0) {
        printf("%" PRIdFAST32 "\t%.*s\n", count2, (int)lenCommPre,
               Buf + SufArr[begin]);
      }
      lenCommPre--;
    }
    while (lenCommPre < LongCommPre[i]) {
      lenCommPre++;
      Stack[lenCommPre] = i;
    }
  }
}

int main(int argc, char *argv[]) {
  if (argc != 3) {
    fputs("usage: commonsub needle haystack\n", stderr);
    exit(EXIT_FAILURE);
  }
  Len = 0;
  Slurp(argv[1]);
  Buf[Len] = -1;
  Len++;
  Begin2 = Len;
  Slurp(argv[2]);
  Buf[Len] = -2;  // sentinel
  BuildSuffixArray();
  if (false) {
    for (I32 i = 0; i < Len; i++) {
      printf("%" PRIdFAST32 "\t%" PRIdLEAST32 "\t%" PRIdLEAST32 "\t%.*s\n", i,
             SufArr[i], LongCommPre[i], (int)(Len - SufArr[i]),
             Buf + SufArr[i]);
    }
  }
  BuildCumCount2();
  FindCommonStrings();
}

After looking at the two strings and thinking about this for a bit I've done this procedure in my head and now I'm going to translate it into steps. 在看了两个字符串并考虑了一下后,我已经完成了这个程序,现在我将把它翻译成步骤。

String 1: ABCDE512ABC361EG51D  // S1
String 2: ADE5AHDW4131EG1DG5C  // S2

When I was thinking about this we can compare characters and or substrings from S1 to S2 while keeping track of occurrences. 当我考虑这个时,我们可以比较从S1到S2的字符和/或子串,同时跟踪事件的发生。

S1[0] = 'A'  compare S2[0]  = 'A' = true : found A in S2 at location 0
S1[0] = 'A'  compare S2[1]  = 'D' = false
S1[0] = 'A'  compare S2[2]  = 'E' = false
S1[0] = 'A'  compare S2[3]  = '5' = false
S1[0] = 'A'  compare S2[4]  = 'A' = true : found A in S2 at location 4
S1[0] = 'A'  compare S2[5]  = 'H' = false
S1[0] = 'A'  compare S2[6]  = 'D' = false
S1[0] = 'A'  compare S2[7]  = 'W' = false
S1[0] = 'A'  compare S2[8]  = '4' = false
S1[0] = 'A'  compare S2[9]  = '1' = false
S1[0] = 'A'  compare S2[10] = '3' = false
S1[0] = 'A'  compare S2[11] = '1' = false; 
S1[0] = 'A'  compare S2[12] = 'E' = false; 
S1[0] = 'A'  compare S2[13] = 'G' = false;
S1[0] = 'A'  compare S2[14] = '1' = false;
S1[0] = 'A'  compare S2[15] = 'D' = false;
S1[0] = 'A'  compare S2[16] = 'G' = false;
S1[0] = 'A'  compare S2[17] = '5' = false;
S1[0] = 'A'  compare S2[18] = 'C' = false;

// End of First Search - Occurrences of 'A' in S2 is 2 at locations {0,4}

// Next Iteration
String 1: ABCDE512ABC361EG51D  // S1
String 2: ADE5AHDW4131EG1DG5C  // S2

// Repeat this for all single characters Of S1 against S2
'A' in S2 = 2  at {0,4}
'B' in S2 = 0 
'C' in S2 = 1  at {18}
'D' in S2 = 3  at {1,6,15}
'E' in S2 = 2  at {2,12}
'5' in S2 = 2  at {3,17}
'1' in S2 = 3  at {9,11,14}
'2' in S2 = 0
'A' Already Found Above Skip
'B' Already Found Above Skip
'C' Already Found Above Skip
'3' in S2 = 1  at {10}
'6' in S2 = 0
'1' Already Found Above Skip
'E' Already Found Above Skip
'G' in S2 = 2  at {13, 16}
'5' Already Found Above Skip
'1' Already Found Above Skip
'D' Already Found Above Skip

This would conclude the first set of iterations for doing all single characters and as you can see we also built a list and a map or sets of not only occurrences but also their locations and we can store them for future references. 这将结束用于执行所有单个字符的第一组迭代,并且您可以看到我们还构建了一个列表和一组或多组不仅出现而且还有它们的位置,我们可以将它们存储以供将来引用。 So if we begin to search for S1[0 & 1] in S2 we know that S1[1] does not exist in S2 so we can break and don't need to go down that chain and since we can break out of that branch we can also skip over doing S1[1 & ...N] and move directly to S1[2] and we know that there is only 1 occurrence of S1[2] which is 'C' in S2 located at {18} which is the end of the string so there is no need to look for S1[2 & ... N] so we can skip over this and move to S1[3] which is 'D' and we know that it does exist in S2 at {1,6,15} so now we can begin our search of S1[3 & ... N] beginning with S2[1 & ... N] then again do the same search of S1[3 & ... N] starting at S2[6 & ... N] and finally again starting S2[15 & ...N] then we have now found all sub strings that start with D in S2 and we can save their occurrences; 因此,如果我们开始在S2中搜索S1 [0&1],我们知道S2中不存在S1 [1],因此我们可以打破并且不需要沿着该链断开,因为我们可以突破该分支我们也可以跳过S1 [1&... N]直接移动到S1 [2],我们知道S1 [2]中只有1次出现,S2位于{18}的'C',是字符串的结尾所以不需要查找S1 [2&... N]所以我们可以跳过这个并移动到S1 [3]即'D'并且我们知道它确实存在于S2中在{1,6,15}所以现在我们可以从S2 [1&... N]开始搜索S1 [3&... N]然后再次对S1进行相同的搜索[3&... N]从S2 [6&... N]开始,最后再次开始S2 [15&... N],然后我们现在找到所有在S2中以D开头的子字符串,我们可以保存它们的出现次数; however this is were we do want to find the longest substring between the two. 但是,我们确实希望找到两者之间最长的子串。 The longest sub string is "DE5" and there is only one occurrence of it, but from this we have also already found the sub strings "DE" & "E5" so we can search for them at this point as well and we then find that there are 1 occurrence of each. 最长的子字符串是“DE5”,只有一次出现,但是我们也已经找到了子字符串“DE”和“E5”,所以我们也可以在这一点搜索它们,然后我们找到它们每次发生1次。 And we just repeat this process. 我们只是重复这个过程。 It will take sort of a long time at first, but the more you traverse through the strings, the faster it will work because of eliminating already found occurrences as well as skipping over non found sub strings of S1 in S2. 一开始需要很长时间,但是遍历字符串越多,它就越快起作用,因为它消除了已经发现的事件以及在S2中跳过未找到的S1子字符串。

This is the logical approach that I took without using any code or programming semantics for it is just the basic algorithm of doing this logically. 这是我在不使用任何代码或编程语义的情况下采用的逻辑方法,因为它只是逻辑上执行此操作的基本算法。 It now becomes a matter of determination to put this into functions and containers to write a source code implementation of it. 现在,将其置于函数和容器中以编写它的源代码实现变得很有把握。

EDIT - As asked in the comments about the difference of this versus another's answer and with the time & space complexity here is a version of my algorithm doing the first pass searching for single characters and creating the tables of positions and if they exist in the 2nd string. 编辑 - 正如评论中提到的这个与其他答案的区别以及时间和空间的复杂性在这里是我的算法的一个版本,第一次搜索单个字符并创建位置表,如果它们存在于第二个串。 The stored vector in the class contains each unique character in S1 within S2. 类中存储的向量包含S2中S1中的每个唯一字符。 This can then be used to help find longer substrings. 然后可以使用它来帮助查找更长的子串。

// C++ - The user asked for this in C but I haven't used C in nearly 10 years so this is my version of it in C++ :( 
#include <string>
#include <vector>

class SubStringSearch {
private:
    std::string S1;
    std::string S2; 

    struct SubstringResult {
        std::string substring;
        bool found;
        std::vector<unsigned> positions;

        SubstringResult(){}
        SubstringResult( const std::string& substringIn, bool foundIn, std::vector<unsigned> positionsIn ) :
            substring( substringIn ), found( foundIn ), positions( positionsIn ) {}
    };

    std::vector<SubstringResult> results;

public:
    SubStringSearch( const std::string& s1, const std::string& s2 ) : S1( s1 ), S2( s2 ) {}

    void compareStringsFirstPass();
    std::vector<unsigned> findLocations( const std::string& str, char findIt );
    void printResults() const;

};

std::vector<unsigned> SubStringSearch::findLocations( const std::string& str, char findIt ) {
    std::vector<unsigned> locations;
    for ( unsigned i = 0; i < str.size(); ++i ) {
        if ( str[i] == findIt ) {
            locations.push_back( i );
        }
    }
    return locations;
}

void SubStringSearch::compareStringsFirstPass() {
    std::vector<unsigned> positions;
    std::string sub;
    bool alreadyFound = false;

    for ( unsigned idx = 0; idx < S1.size(); ++idx ) {
        sub = S1[idx];

        if ( idx > 0 ) {
            for ( unsigned u = 0; u < results.size(); ++u ) {
                if ( sub == results[u].substring ) {
                    alreadyFound = true;
                    break;
                }
            }
        }

        // Added An If Else Here To Reduce Unneeded Calls To findLocations()
        if ( alreadyFound ) {
            alreadyFound = false;
            continue;
        } else {
            positions = findLocations( S2, S1[idx] );
        }

        if ( positions.size() > 0 && !alreadyFound ) {
            results.push_back( SubstringResult( sub, true, positions ) );
        } else if ( !alreadyFound ) {
            positions.clear();
            results.push_back( SubstringResult( sub, false, positions ) );
        }

        positions.clear();
        alreadyFound = false;
    }
}

void SubStringSearch::printResults() const {
    for ( unsigned u = 0; u < results.size(); ++u ) {
        if ( results[u].found ) {
            std::cout << results[u].substring << " found in S2 at " << std::setw(2);
            for ( unsigned i = 0; i < results[u].positions.size(); ++i ) {
                std::cout << std::setw(2) << results[u].positions[i] << " ";
            }
            std::cout << std::endl;
        }
    }
}

int main() {
    std::string S1( "ABCDE512ABC361EG51D" );
    std::string S2( "ADE5AHDW4131EG1DG5C" );

    SubStringSearch searchStrings( S1, S2 );
    searchStrings.compareStringsFirstPass();

    std::cout << "break point";

    return 0;
} // main

Place a break point on that last print line and go into your debugger for either your locals or your autos in MSVC or something equivalent for your version of your compiler / debugger and check out the contents of the class's member variable that is a std::vector and you will see the character from S1 and attached to it will be a bool flag if it is found or not as well as a std::vector for each of the positions. 在最后一个打印行上放置一个断点,然后进入您的本地人或MSVC中的汽车的调试器或您的编译器/调试器版本的等效项,并查看类的成员变量std ::的内容向量,你会看到S1中的字符和附加到它的字符将是一个bool标志,如果它被找到或不与每个位置的std :: vector。 So if the flag is false then the vector size should be 0 and vise versa if the vector size is > 0 then the flag should be true; 因此,如果标志为假,则向量大小应为0,反之亦然,如果向量大小> 0,则标志应为真; also the size of the vector of positions is also the count or the occurrences of that character in the 2nd string which makes this nice because we don't have to calculate anything else we can just get that from the vector itself. 位置向量的大小也是第二个字符串中该字符的计数或出现,这使得这很好,因为我们不必计算任何其他我们可以从向量本身得到的。

Now this is not the complete or full algorithm as this is only the first pass of doing each single character of string 1 and looking into string 2 while building the needed table and skipping over contents that have already been found. 现在这不是完整或完整的算法,因为这只是第一次执行字符串1的每个单个字符,并在构建所需的表并跳过已找到的内容时查看字符串2。 It will be up to the OP to build upon this if they so choose to complete the rest of the algorithm. 如果他们选择完成算法的其余部分,那将由OP构建。 If I happen to find some free time in the near future I may go ahead and complete the full algorithm. 如果我碰巧在不久的将来找到一些空闲时间,我可以继续完成整个算法。

From what I can understand, breaking up the string to all possible sub-strings is in itself an O(n*n) operation. 根据我的理解,将字符串分解为所有可能的子字符串本身就是O(n * n)操作。

abcd
====
a,b,c,d
ab,bc,cd
abc,bcd
abcd
************************
abcdefgh
========
a,b,c,d,e,f,g,h
ab,bc,cd,de,ef,fg,gh
abc,bcd,cde,def,efg,fgh
abcd,bcde,cdef,defg,efgh
abcde,bcdef,cdefg,defgh
abcdef,bcdefg,cdefgh
abcdefg,bcdefgh
abcdefgh

As such, it doesn't look like a solution in linear time is possible. 因此,它看起来不像线性时间的解决方案。

Further more to actually solve it, from a Java language perspective, you'd have to first break it up and store it in a set or a map (map can have substring as key and the number of occurrences as count). 更实际地解决它,从Java语言的角度来看,你必须首先将其分解并将其存储在集合或映射中(映射可以将子字符串作为键,并将出现的次数作为计数)。

Then repeat the step for the second string as well. 然后重复第二个字符串的步骤。

Then you can iterate over the first, checking if the entry exists in the second string's map and also increment the number of occurrences for that sub-string in parallel. 然后,您可以迭代第一个,检查第二个字符串的映射中是否存在该条目,并同时增加该子字符串的出现次数。

If you are using 'C', then you can try sorting the array of sub-strings and then use binary search to find matches (while having a two-dimensional array to keep track of the string and the count of occurrences). 如果您使用'C',那么您可以尝试对子字符串数组进行排序,然后使用二进制搜索来查找匹配项(同时使用二维数组来跟踪字符串和出现次数)。

You said you had a tree approach that ran faster. 你说你有一个跑得快的树的方法。 Do you mind posting a sample so as to how you used a tree ? 你介意发布样本以便你如何使用树吗? Was it for representing the sub-strings or to help generate it? 它是用于表示子字符串还是帮助生成子字符串?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM