简体   繁体   中英

Most effective way to lookup a substring C of string B in string A in LINQ

Having 2 strings like:

string a = "ATTAGACCTGCCGGAA";
string b = "GCCGGAATAC";

I would like to just delete the part that is common in both strings and then the rest concatenate it. I have to tell that what I need to delete only left matched part so I would get

input

 ATTAGACCTGCCGGAA
          GCCGGAATAC

output

ATTAGACCTGCCGGAATAC

Firstly I thought to use a pattern and then seacrh for it, however this is not possible as I do not know the pattern in advance (the length of matched chars is variable)

Then I thought on search whole string b in a then if had no succes, delete a char in string a (Last one since I want to preserve most left unmatched string) and then loop until I have no more chars in b like

string a = "ATTAGACCTGCCGGAA";
string b = "GCCGGAATAC";
int times = b.Length;
string wantedString = string.Empty;
string auxString = b;
while (times > 0)
{

    if (!a.Contains(auxString))
    {
        //save last char and then delete it from auxString
        wantedString += auxString[auxString.Length - 1];
        auxString = auxString.TrimEnd(auxString[auxString.Length - 1]);
    }
    else
        break;
    times--;
}
//reverse string 
char[] reversedToAppend = wantedString.ToCharArray();
Array.Reverse(reversedToAppend);
string toAppend = new string(reversedToAppend);

so the answer would be just to do a + toAppend ;
Is there a way to make this more efficient? (maybe in LINQ?)

Edit

As @lavin points out correctly c can occur anywhere in a , while being a prefix of b. for instance if a=AAT and b=AAG , code should return AATG . the reason is because common string starting on left is c=AA . We delete this from b and then we get a=AAT with the resulting G

AAT
AAG

resulting

AATG

Other example would be:

a=ATTTGGGCCGCGCGCGAAAACCCCGCG
b=                  AACCCCGCGCGCA

here

c= AACCCCGCG

so result should be

result = ATTTGGGCCGCGCGCGAAAACCCCGCGCGCA

(all arrays and strings are 0 based in this answer)

First I want to point out that OP's problem is confusing. Assume c is the common part of a and b , OP's example of input and output suggest that c needs to be the suffix of a , and the prefix of b at the same time. I see some of the answers above adopted this understanding of the problem.

However, the original implementation provided by OP suggests that, c can occur anywhere in a , while being a prefix of b , because your using of a.Contains(auxString) . That means for a=AAT and b=AAG , your code will return AATG . However other people's answers will return AATAAG .

So there are two possible interpretation of your problem. Please clarify.

Second, assuming the size of the first string a is N , and the second string b is M , unlike the O(N*M) solution provided in the original solution and existing answers, an O(N+M) algorithm can be achieved by using any of the following: KMP, Suffix Array, Suffix Tree, Z-algorithm.

I'll briefly describe how to use Z-algorithm to solve this problem here, since it seems to be much less mentioned on stackoverflow compared to others.

About details of Z-algorithm, see http://www.cs.umd.edu/class/fall2011/cmsc858s/Lec02-zalg.pdf

Basically for a string S of length L , it calculates an array Z of length L , in which Z[i] equals to the longest common prefix of S and S[i:] ( S[i:] means substring of S starting from position i ).

For this problem, we combine strings a and b into d=b+a ( b in front of a ), and calculates the Z array of the combined string d . Using this Z array, we can easily figure out the longest prefix of b that also occurs in a .

For possible interpretation one of the problem, in which c needs to be the suffix of a and prefix of b :

max_prefix = 0
for i in range(M, N+M):
  if Z[i] == N+M - i: 
    if Z[i] > max_prefix:
      max_prefix = Z[i]

and the answer would be:

a+b[max_prefix:]

For possible interpretation two of the problem, in which c needs to be the prefix of b , and can be anywhere in a :

max_prefix = 0
for i in range(M, N+M):
  if Z[i] > max_prefix:
    max_prefix = Z[i]

again the answer would be:

a+b[max_prefix:]

The difference in those two cases are this line:

  if Z[i] == N+M-i: 

To understand this line, remember that Z[i] is the longest common prefix of strings d and d[i:] , then:

  1. Note that d=b+a
  2. We enumerate i from M to M+N-1 , that's the range of a in d . So d[i:] is equal to a[iM:] . And the length of a[iM:] is N-(iM)=N+Mi .
  3. Since d starts with b , checking if Z[i] is equal to N+Mi , is checking if a[iM:] is also a prefix of b . If they are indeed equal, then we found a common string c , which is the prefix of b , and also a suffix of a .
  4. Without this line, we only know that we found a string c which is a prefix of b , and occurs in a starting from position i , and is not guaranteed to reach the end of a .

This works to find the first point that b overlaps the tail of a :

string a = "ATTAGACCTGCCGGAA";
string b = "GCCGGAATAC";

var index =
(
    from n in Enumerable.Range(0, a.Length)
    where a.Skip(n).SequenceEqual(b.Take(a.Length - n))
    select n
)
    .DefaultIfEmpty(-1)
    .First();

In this example it returns 9 .

The final output is:

var output = a + b.Substring(a.Length - index);

Which evaluates to:

ATTAGACCTGCCGGAATAC

This all assumes that the overlap occurs at the end of a and the beginning of b .

Linq will not really help you here.

If n and m are the length of the left and right messages, it looks like you will have a O(nm) solution...

Fist compress your messages.

Since, there are only 4 possible letters, you can code it on 2 bits.

That it, 4 letters by bytes. (instead of 2 bytes by letter).

In one 32 bits comparison you will check 16 letters instead of 2.

Then (enter mystic late drunk thinking) perform two parallel and incremental FFT by reading the data from the ends you want to merge (from the end for the left message and the start for the right one) when the FFT match, you likelihood have a match. Check for it.

The real implementation of it will more likely be:

  • Read the data from the ends you want to merge (from the end for the left message and the start for the right one) and, while you read the 'letters' of the two messages:

    • Build the sum of the data. L[n-1]+L[n-2]+L[n-3]+L[n-4]+.. and R[0]+R[1]+R[2]+R[3]+..

    • Build the alternate sum. L[n-1]-L[n-2]+L[n-3]-L[n-4]+.. and R[0]-R[1]+R[2]-R[3]+..

    • Build the 2-alternate sum. L[n-1]+L[n-2]-L[n-3]-L[n-4]+.. and R[0]+R[1]-R[2]-R[3]+..

    • and few more (4,8,16-alternate sums).

When you have a match. Check for it.

If real DNA give a lot of false positive matches, write a paper about it.

[EDIT]

The sum will match. Ok. But the alternate sum will only match in absolute value.

If the messages are ... 4 5 6 and 5 6 7 ...

The sum of the two first values will be 5 + 6 = 11 in both cases.

But the alternate sum will be -5 + 6 = 1 and 5 - 6 = -1.

For the 2,4..-alternate sum you will have an issue...

You need other operation where the order doesn't matter. Like multiplication and XOR.

Not sure if I understand the question. I'll guess the following: Take 2 strings, A and B , if correspondence C exists, then D = A + (B - C) .

class Program
{
    static void Main(string[] args)
    {
        Test test = new Test();

        string a = "ATTAGACCTGCCGGAA";
        string b = "GCCGGAATAC";

        string match = test.Match(a, b); // GCCGGAA

        if (match != null)
        {
            string c = a + b.Remove(b.IndexOf(match), match.Length); // ATTAGACCTGCCGGAATAC

            Console.WriteLine(c);
        }
    }
}

class Test
{
    public string Match(string a, string b)
    {
        if (a == null)
        {
            throw new ArgumentNullException("a");
        }

        if (b == null)
        {
            throw new ArgumentNullException("b");
        }

        string best = null;

        for (int i = 0; i < b.Length; i++)
        {
            string match = Match(a, b, i);

            if (match != null && (best == null || match.Length > best.Length))
            {
                best = match;
            }
        }

        return best;
    }

    private string Match(string a, string b, int offset)
    {
        string best = null;

        for (int i = offset; i < b.Length; i++)
        {
            string s = b.Substring(offset, (i - offset) + 1);
            int index = a.IndexOf(s);

            if (index != -1)
            {
                best = s;
            }
        }

        return best;
    }
}

If you want a more optimal version, then modify Test to return an index.

Here's mine. I think it's the most concise, and I don't see a way of making it more efficient

string  Merge(string a, string b)
{
    // Assumes A is always longer.

    for(int i =b.Length; i >0; --i)
    {
        if (a.EndsWith(b.Substring(0,i)))
            return a + b.Substring(i);
    }

    return null;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM