简体   繁体   English

最长的普通子串受图案约束

[英]Longest common substring constrained by pattern

Problem: 问题:

I have 3 strings s1, s2, s3. 我有3个字符串s1,s2,s3。 Each contain garbage text on either side, with a defining pattern in its centre: text1+number1 . 每个页面的两侧都包含垃圾文本,其中心有一个定义模式: text1+number1 number1 increases by 2 in each string. number1在每个字符串中增加2。 I want to extract text1+number1 . 我想提取text1+number1

I have already written code to find number1 我已经写了代码来查找number1

How would I extend an LCS function to get text1? 如何扩展LCS函数以获取text1?

#include <iostream>

const std::string longestCommonSubstring(int, std::string const& s1, std::string const& s2, std::string const& s3);

int main(void) {
    std::string s1="hello 5", s2="bolo 7", s3="lo 9sdf";
    std::cout << "Trying to get \"lo 5\", actual result: \"" << longestCommonSubstring(5, s1, s2, s3) << '\"';
}

const std::string longestCommonSubstring(int must_include, std::string const& s1, std::string const& s2, std::string const& s3) {
    std::string longest;

    for(size_t start=0, length=1; start + length <= s1.size();) {
        std::string tmp = s1.substr(start, length);
        if (std::string::npos != s2.find(tmp) && std::string::npos != s3.find(tmp)) {
            tmp.swap(longest);
            ++length;
        } else ++start;
    }

    return longest;
}

Example: 例:

From "hello 5" , "bolo 7" , "lo 9sdf" I would like to get "lo 5" 我想从"hello 5""bolo 7" "lo 9sdf" "bolo 7""lo 9sdf"中获取"lo 5"

Code: 码:

I have been able to write a simple LCS function( test-case ) but I am having trouble writing this modified one. 我已经能够编写一个简单的LCS函数( test-case ),但是在编写此修改后的函数时遇到了麻烦。

Let's say you're looking for a pattern *n, *n+2, *n+4, etc. And you have the following strings: s1="hello 1,bye 2,ciao 1", s2="hello 3,bye 4,ciao 2" and s3="hello 5,bye 6,ciao 5". 假设您要寻找模式* n,* n + 2,* n + 4等。并且您具有以下字符串:s1 =“ hello 1,bye 2,ciao 1”,s2 =“ hello 3,再见4,ciao 2“和s3 =” hello 5,再见6,ciao 5“。 Then the following will do: 然后,将执行以下操作:

//find all pattern sequences
N1 = findAllPatterns(s1, number);
 for i = 2 to n:
  for item in Ni-1:
   for match in findAllPatterns(si, nextPattern(item))
    Ni.add([item, (match, indexOf(match))]);

//for all pattern sequences identify the max common substring
maxCommonLength = 0; 
for sequence in Nn:
 temp = findLCS(sequence);
 if(length(temp[0]) > maxCommonLength):
  maxCommonLength = length(temp[0]);
  result = temp;

return result;

` The first part of the algorithm will identify the sequences: [(1, 6), (3, 6), (5, 6)], [(1, 19), (3, 6), (5, 6)], [(2, 12), (4, 12), (6, 12)] `算法的第一部分将识别序列:[[(1,6),(3,6),(5,6)],[(1,19),(3,6),(5,6) ],[(2,12),(4,12),(6,12)]

The second part will identify: ["hello 1", "hello 3", "hello 5"] as the longest substrings matching the pattern. 第二部分将识别:[“ hello 1”,“ hello 3”,“ hello 5”]是与模式匹配的最长子字符串。

The algorithm can be further optimized by combining the two parts and discarding early sequences that match the pattern but are suboptimal, but I preferred to present it in two parts for better clarity. 可以通过组合两个部分并丢弃与模式匹配但次优的早期序列来进一步优化该算法,但为了更好的说明,我更倾向于将其分为两个部分。

-- Edit fixed code block -编辑固定代码块

If you know number1 already, and you know these numbers all appear just once in their corresponding strings, then the following should work: 如果您已经知道number1 ,并且知道这些数字在它们相应的字符串中number1出现一次,那么下面的方法应该起作用:

I'll call your strings s[0] , s[1] , etc. Set longest = INT_MAX . 我将您的字符串称为s[0]s[1]等。设置longest = INT_MAX For each string s[i] (i >= 0) just: 对于每个字符串s[i] (i> = 0)仅:

  • Find where number1 + 2 * i occurs in s[i] . 找出number1 + 2 * is[i] Suppose it occurs at position j . 假设它发生在位置j
  • If (i == 0) j0 = j; 如果(i == 0)j0 = j; else 其他
    • for (k = 1; k <= j && k <= longest && s[i][j - k] == s[0][j0 - k]; ++k) {} for(k = 1; k <= j && k <=最长&& s [i] [j-k] == s [0] [j0-k]; ++ k){}
    • longest = k; 最长= k;

At the end, longest will be the length of the longest substring common to all the strings. 最后, longest将是所有字符串共有的最长子字符串的长度。

Basically we're just scanning backwards from the point where we find the number, looking for a mismatch with the corresponding character in your s1 (my s[0] ), and keeping track of what the longest matching substring is so far in longest -- this can only stay the same or decrease with each new string we look at. 基本上,我们只是从找到数字的位置开始向后扫描,寻找与您的s1对应的字符不匹配(我的s[0] ),并跟踪迄今为止最长的匹配子字符串中longest - -这只能与我们查看的每个新字符串保持相同或减少。

Rather than try to modify the internals of the LCS algorithm, you could take its output and find it in s1. 与其尝试修改LCS算法的内部结构,不如获取其输出并在s1中找到它。 From there, your number will be located at an offset of the length of the output plus 1. 从这里开始,您的数字将位于输出长度加1的偏移量处。

Wrote my own solution: 编写了自己的解决方案:

#include <iostream>
#include <string>
#include <sstream>
#include <vector>

typedef std::pair<std::pair<std::string, std::string>, std::pair<std::pair<std::string, std::string>, std::pair<std::string, std::string>>> pairStringTrio;
typedef std::pair<std::string,std::pair<std::string,std::string>> stringPairString;

stringPairString longestCommonSubstring(const pairStringTrio&);
std::string strFindReplace(const std::string&, const std::string&, const std::string&);

int main(void) {
        std::string s1= "6 HUMAN ACTIONb", s2="8 HUMAN ACTIONd", s3="10 HUMAN ACTIONf";
        pairStringTrio result = std::make_pair(std::make_pair(s1, "6"), std::make_pair(std::make_pair(s2, "8"), std::make_pair(s3, "10")));

        stringPairString answer = longestCommonSubstring(result);
        std::cout << '\"' << answer.first << "\"\t\"" << answer.second.first << "\"\t\"" << answer.second.second << '\"';
}


stringPairString longestCommonSubstring(const pairStringTrio &foo) {
        std::string longest;

        for(size_t start=0, length=foo.first.first.size()-1; start + length <= foo.first.first.size();) {
                std::string s1_tmp = foo.first.first.substr(start, length);
                std::string s2_tmp = strFindReplace(s1_tmp, foo.first.second, foo.second.first.second);
                std::string s3_tmp = strFindReplace(s1_tmp, foo.first.second, foo.second.second.second);

                if (std::string::npos != foo.second.first.first.find(s2_tmp) && std::string::npos != foo.second.second.first.find(s3_tmp)) {
                        s1_tmp.swap(longest);
                        ++length;
                } else ++start;
        }

        return std::make_pair(longest, std::make_pair(strFindReplace(longest, foo.first.second, foo.second.first.second), strFindReplace(longest, foo.first.second, foo.second.second.second)));
}

std::string strFindReplace(const std::string &original, const std::string& src, const std::string& dest) {
        std::string answer=original;
        for(std::size_t pos = 0; (pos = answer.find(src, pos)) != answer.npos;)
                answer.replace(pos, src.size(), dest);
        return answer;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM