简体   繁体   中英

How to Find all occurrences of a Substring in C

I am trying to write a parsing program in C that will take certain segments of text from an HTML document. To do this, I need to find every instance of the substring "name": in the document; however, the C function strstr only finds the first instance of a substring. I cannot find a function that finds anything beyond the first instance, and I have considered deleting each substring after I find it so that strstr will return the next one. I cannot get either of these approaches to work.

By the way, I know the while loop limits this to six iterations, but I was just testing this to see if I could get the function to work in the first place.

while(entry_count < 6)
{   
    printf("test");
    if((ptr = strstr(buffer, "\"name\":")) != NULL)
    {   
        ptr += 8;
        int i = 0;
        while(*ptr != '\"')
        {   
            company_name[i] = *ptr;
            ptr++;
            i++;
        }   
        company_name[i] = '\n';
        int j;
        for(j = 0; company_name[j] != '\n'; j++)
            printf("%c", company_name[j]);
        printf("\n");
        strtok(buffer, "\"name\":");
        entry_count++;
    }   
}   

Just pass the returned pointer, plus one, back to strstr() to find the next match:

char *ptr = strstr(buffer, target);
while (ptr) {
    /* ... do something with ptr ... */
    ptr = strstr(ptr+1, target);
}

Ps. While you certainly can do this, I'd like to suggest the you may wish to consider more suitable tools for the job:

  • C is a very low-level language, and trying to write string parsing code in it is laborious (especially if you insist on coding everything from scratch, instead of using existing parsing libraries or parser generators) and prone to bugs (some of which, like buffer overruns, can create security holes). There are plenty of higher-level scripting languages (like Perl, Ruby, Python or even JavaScript) that are much better suited for tasks like this.

  • When parsing HTML, you really should use a proper HTML parser (preferably combined with a good DOM builder and query tool). This will allow you to locate the data you want based on the structure of the document, instead of just matching substrings in the raw HTML source code. A real HTML parser will also transparently take care of issues like character set conversion and decoding of character entities. (Yes, there are HTML parsers for C, such as Gumbo and Hubbub , so you can and should use one even if you insist on sticking to C.)

/*  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *\
 *                                                  *
 *  SubStg with parameters in the execution line    *
 *  Must use 2 parameters                           *
 *  The 1st is the string to be searched            *
 *  The 2nd is the substring                        *
 *  e.g.:  ./Srch "this is the list" "is" >stuff    *
 *  e.g.:  ./Srch "$(<Srch.c)" "siz"                *
 *  (ref: http://1drv.ms/1PuVpzS)                   *
 *  © SJ Hersh 15-Jun-2020                          *
 *                                                  *
\*  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  *  */


#include <stdio.h>
#include <stdlib.h>
#include <string.h>
typedef char* char_ptr;
typedef unsigned int* int_ptr;
#define NOMEM ( int_ptr )0

int main( int parm, char** stgs )
{
   char_ptr string, substg;
   unsigned int sizstg, sizsub, endsiz, *ary;
   int_ptr startmem;
   register unsigned int x, y, ctr=0;

   if( parm != 3 )
   {
      printf( "ERR: You need exactly 2 string arguments\n" );
      return ( -8 );
   }

   string = stgs[ 1 ];
   substg = stgs[ 2 ];
   sizstg = strlen( string );
   sizsub = strlen( substg );
   endsiz = sizstg - sizsub + 1;


      /* Check boundary conditions: */

if( ( sizstg == 0 ) || ( sizsub == 0 ) )
{
   printf( "ERR: Neither string can be nul\n" );
   return( -6 );
}

if( sizsub > sizstg )
{
   printf( "ERR: Substring is larger than String\n" );
   return( -7 );
}

if( NOMEM == ( ary = startmem = malloc( endsiz * sizeof( int ) ) ) )
{
   printf( "ERR: Not enough memory\n" );
   return( -9 );
}


      /* Algorithm */

   printf( "Positions:\t" );

   for( x = 0; x < endsiz; x++ )
      *ary++ = string[ x ] == substg[ 0 ];

   for( y = 1, ary = startmem; y < sizsub; y++, ary = startmem )
      for( x = y; x < ( endsiz + y ); x++ )
         *ary++ &= string[ x ] == substg[ y ];

   for( x = 0; ( x < endsiz ); x++ )
      if( *ary++ )
      {
         printf( "%d\t", x );
         ctr++;
      }

   printf( "\nCount:\t%d\n", ctr );
   free( startmem );
   return( 0 );
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM