简体   繁体   中英

Code optimization for string comparisons

I have a function code for string comparison as below :

#include <stdio.h>
#include <string.h>
#include <math.h>
#include <stdlib.h>

int max=0;

int calcMis(char *string,int i, int j,int len)
{
     int mis=0;
     int k=0;
     while(k<len)
     {
             if(string[i+k]!=string[j+k])
                mis+=1;
             if((mis+len-k-1)<=max)
                 return 1;
             else if(mis>max)
                 return 0;
             k=k+1;
     }
}

int main()
{
    char *input=malloc(2000*sizeof(char));
    scanf("%d",&max);
    scanf("%s",input);
    int c=0,i,j,k,x,mis=0;
    int len=strlen(input);
    i=0;
    while(i<len-1)
    {
        j=i;
        while(j<len-1)
         {
             k=i+1;
             x=j-i+1;
             if(x<=max)
                 c=c+len-k-x+1;
             else
                while(k+x<=len)
                {
                  if(strncmp(input+i,input+k,x+1)==0)
                   {
                      if(max>=0)
                          c=c+x;
                   }
                  else
                   c+=calcMis(input,i,k,x);
                  k=k+1;
                }       
            j=j+1;
         }
        i=i+1;
    }   
    printf("%d",c);
    return 0;   
}  

This codes is the solution for the question :

Given a string S and and integer K, find the integer C which equals the number of pairs of substrings(S1,S2) such that S1 and S2 have equal length and Mismatch(S1, S2) <= K where the mismatch function is defined below.

eg : abc then the substrings are {a,ab,abc,b,bc,c}

Is there any better method than this. Is there any optimizations possible in this code?

NOTE: This analysis was made before he/she edited the post and included the rest of his/her code. He/she made no mention to the main function in the original post (in which I supplied my answer to).


Looking at your code for calcMis , here are some of the readability and style improvements I would make:

  • Remove all of the return statements from within the loop. Not a big deal for small loops but a much bigger deal in larger loops as it's harder to debug when it has 3 or 4 extra cases to leave the loop.
  • Redefine your parameters with respect to what the function does.

Your algorithm runs in order n but we can reduce some of the operations it performs. My analysis of your algorithm is as follows:

assignment operator            (=)  x4: O(1)
while loop                          x1: O(n), where n is len.
  dereference operator           (*)  x2: O(1)
  less than operator             (<)  x1: O(1)
  does not equal operator        (!=) x1: O(1)
  addition operator              (+)  x4: O(1)
  subtraction operator           (-)  X2: O(1)
  less than or equal to operator (<=) x1: O(1)
  Order: O(n) + 2 * O(1) + O(1) + O(1) + 4 * O(1) + 2 * O(1) + 1 * O(1) = O(n)
Order: 4 * O(1) + O(n) = O(n)

Here is the improved algorithm (micro efficiency and readability improvements) -- still of a linear order but less instructions and takes advantage of const optimizations by the compiler:

bool calcMis( char const * const str, int const i, int const j, int const len ) {
  // Checks pre conditions.
  assert( str != NULL );

  // Determines if the length is 0, if so return 0 mismatches.
  if ( len == 0 ) return true;

  // Determines if we are comparing at the same index, if so return 0 mismatches.
  if ( i == j ) return true;

  // Defines an integer mis, holds the number of mismatches.
  int mis = 0;

  // Iterates over the entire string of length len.
  for ( int k = 0; ( k < len ) && ( mis < max ); k++ ) {
    // Determines whether there was a mismatch at positions i and j.
    if ( str[ i + k ] != str[ j + k ] ) mis += 1;
  }

  // Defines a bool result, determines whether we have had too many mismatches.
  bool const result = !( mis > max );

  return result;
}

Here is an idea that might help. First of all, compare all pairs of characters in the string:

void compare_all(char* string, int length, int* comp)
{
    for (int i1 = 0; i1 < length; ++i1)
        for (int i2 = 0; i2 < length; ++i2)
            result[i1 * length + i2] = (string[i1] != string[i2]);
}

Here comp represents a square matrix containing values 0 and 1. Each pair of substrings corresponds to a diagonal section in this matrix. For example, for the string "testing", the following section of the matrix represents substrings "tes" and "tin".

. . . O . . .
. . . . O . .
. . . . . O .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .

You have to count how many sections have sum of elements no more than k . To do that, examine all the diagonals that are parallel to the main diagonal, one by one. In order not to count things twice, look only at those that are below (or above) the main diagonal (let's include the main diagonal for simplicity).

int count_stuff(int* comp, int n, int k)
{
    int result = 0;
    for (diag = 0; diag < n; ++diag)
    {
        int* first_element_in_diagonal = comp + diag;
        int jump_to_next_element = n + 1;
        int length_of_diagonal = n - diag;
        result += count_stuff_on_diagonal(
            first_element_in_diagonal,
            jump_to_next_element,
            length_of_diagonal,
            k);
    }
    return result;
}

Now, the problem is a much simpler one: find the number of sections along a sequence of integers, for which the sum is not greater than k . The most straightforward method is by enumerating all such sections.

int count_stuff_on_diagonal(int* comp, int jump, int n, int k)
{
    int result = 0;
    for (int i1 = 0; i1 < n; ++i1)
        for (int i2 = i1 + 1; i2 < n; ++i2)
        {
            int* first_element_in_section = comp + i1 * jump;
            int mismatches = count_sum_of_section(
                first_element_in_section,
                jump,
                i2 - i1);
            if (mismatches <= k)
                ++result;
        }
    return result;
}

To improve the speed of calculating a sum across a section of consecutive integers, build a table of cumulative sums ; use it instead on the matrix of 0s and 1s.

(Please excuse me for not using const and VLA, and occasional syntax errors).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM