简体   繁体   中英

Is there an alternative to the std::string substring?

Given a string s = "RADILAMIA" I want to take all the substrings of length 4 (or something else).

If len == 4 then the substrings are: "RADI","ADIL","DILA","ILAM","LAMI","AMIA". It seems easy to do that by using the std::string substr method:

vector<string> allSubstr(string s,int len) {
    vector<string>ans;
    for(int i=0;i<=s.size()-len;i++) {
        ans.push_back(s.substr(i,len));
    }
    return ans;
}

substr 's time complexity is unspecified, but generally linear against the length of the substring.

Can I do this without std::string substr . Any substring and the previous substring differ in only one letter. Is there any better way to reduce the time complexity?

There can be millions of different approaches. Here is my algorithm.

vector<string> allSubstr(string s,int len) {

    vector<string>ans;
    ans.reserve(s.size() - len );
    for(size_t i=0;i<=s.size()-len;i++) 
    {
        ans.emplace_back( s.begin() +i, s.begin() + i + len );
    }

    return ans;
}

It is tested. I mean it wouldn't matter what you are using but emplace_back above can make a difference since there won't be copy cost. Also you add reserve for more performance.

No matter what you do, you still need O(NL) time to write all your substrings into the vector.

The fastest thing would be probably:

vector<string> ans(s.size()-len);
for(int i=0;i<=s.size()-len;i++) {
    ans[i] = s.substr(i, len);
}

Because push_back is slowish, and should generally be avoided if possible. It is overused.

PS: maybe this code would be even faster:

vector<string> ans(s.size()-len);
for(int i=0;i<=s.size()-len;i++) {
    ans[i].append(s.begin()+i, s.begin()+i+len);
}

string_view (C++17) has a constant time substr :

vector<string_view> allSubstr(const string_view& s, int len) {
    vector<string_view> ans;
    and.reserve(s.size() - len + 1);
    for (int i = 0 ; i <= s.size() - len; ++i) {
        ans.push_back(s.substr(i, len));
    }
    return ans;
}

Just make sure that s outlives the return value of the function.

Probably you could use an array of chars instead. For example, you have got your word:

char s[] = "RADILAMIA";

To deal with all necessary substrings you can use such approach:

int substLength = 4;
int length = strlen(s);
char buffer[256];
for (int i = 0; i < length - substLength + 1; i++) {
    strncpy(buffer, s + i, substLength);
    buffer[substLength] = '\0';
    cout << buffer << endl;
}

Using the char array you easily can access to the start of any substring by adding the necessary index to the beginning of the array.

It pays to revisit the docos

// string proto(len);
vector<string> result(s.size()-len, string(len, char(32))); // preallocates the buffers

const char *str=s.c_str();
const char* end=str+s.size()-len;

for(size_t i=0; str<end; str++, i++) {
  result[i].assign(str, len); // likely to result in a simple copy in the preallocated buffer
}

The complexity is the same O(len*s.size()) - one can only hope for a smaller proportionality factor.

C is not always faster than C++ but @Fomalhaut was right to post the performant core solution in C. Here is my (C program) complete version, based on his algorithm. Without using strncpy, too.

Here it is on the godbolt .

#ifdef __STDC_ALLOC_LIB__
#define __STDC_WANT_LIB_EXT2__ 1
#else
#define _POSIX_C_SOURCE 200809L
#endif

#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
#include <assert.h>
#include <malloc.h>

//////////////////////////////////////////////////////////////
// array of buffers == a_of_b
typedef struct a_of_b  {
    const unsigned size;
    unsigned count ;
    char ** data   ;
} a_of_b ;

a_of_b a_of_b_make ( const unsigned size_ ) 
{
   return (a_of_b){ .size = size_, .count = 0, .data = calloc(1, sizeof(char * [size_] ) ) } ;
}

a_of_b * a_of_b_append ( a_of_b * self,  const unsigned len_, const char str_[len_] ) 
{
    assert( self->data ) ;
    assert( self->size > self->count ) ;
    self->data[ self->count ] = strndup( str_, len_ ) ;
    self->count += 1;
    return self ;
}

a_of_b * a_of_b_print ( a_of_b * self , const char * fmt_ ) 
{
    for (unsigned j = 0; j < self->count; ++j)
         printf( fmt_ , self->data[j]);
    return self ;
}

a_of_b * a_of_b_free ( a_of_b * self  ) 
{
    for (unsigned j = 0; j < self->count; ++j)
         free( self->data[j]) ;
    free( self->data) ;
    self->count = 0  ;         
    return self ;
}
//////////////////////////////////////////////////////////////
a_of_b breakit ( const unsigned len_, const char input_[len_], const unsigned  substLength )
{
    assert( len_ > 2 ) ;
    assert( substLength > 0 ) ;
    assert( substLength < len_ ) ;

    const unsigned count_of_buffers = len_ - substLength + 1;

    a_of_b rez_ = a_of_b_make( count_of_buffers +1 ) ;

    for (int i = 0; i < count_of_buffers ; i++) {
        a_of_b_append( &rez_, substLength, input_ + i ) ;
    }

   return rez_ ;
}
//////////////////////////////////////////////////////////////
static void driver( const char * input_, const unsigned substLength ) 
{
    printf("\n");
    a_of_b substrings = breakit( strlen(input_), input_, substLength );
    a_of_b_print( & substrings , "%s ");
    a_of_b_free( & substrings);
}
//////////////////////////////////////////////////////////////
int main () { 

    driver( "RADILAMIA", 4) ;
    driver( "RADILAMIA", 3) ;
    driver( "RADILAMIA", 2) ;
    driver( "RADILAMIA", 1) ;
    
    return EXIT_SUCCESS; 
}

And the program output is:

RADI ADIL DILA ILAM LAMI AMIA 

RAD ADI DIL ILA LAM AMI MIA 

RA AD DI IL LA AM MI IA 

R A D I L A M I A 

Enjoy.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM