简体   繁体   中英

I'm trying to make string parser but something is going wrong

I was trying to make a text parser which separates words in a string based on the space character. However, something is going wrong.

#include <stdio.h>
#include <string.h>

int main() {
    //the string should end with a space to count the all words
    char name[30] = "hello world from jordan ";
    int start = 0;
    int end = strlen(name);
    int end_word = start;
    char full[20][20];

    memset(full, 0, 400);

    int number_of_words = 0;

    for (int w = 0; w < end; w++) {
        if (name[w] == ' ') {
            number_of_words++;
        }
    }

    int counter = 0;

    while (counter < number_of_words) {
        for (int i = start; i < end; i++) {
            if (name[i] == ' ') {
                start = i;
                break;
            }
        }

        for (int j = end_word; j < start; j++) {
            full[counter][j] = name[j];
        }

        end_word = start;
        start++;
        counter++;
    }

    for (int x = 0; x < 20; x++) {
        for (int y = 0; y < 20; y++) {
            printf("%c", full[x][y]);
        }

        printf("%d", x);
    }

    return 0;
}

here is the strange thing happening when I run the code:

 hello0 world1 from2 jor3dan45678910111213141516171819

the first three words are being initialized in the right way but the fourth is not and I don't know why this is happening.

I want an explanation for the problem and if possible I want a more efficient way of writing this code without pointers pointers .

Note: I'm a beginner that's why I'm asking for a solution without pointers.

To start, trying to avoid pointers in C will be (very) hard. Just by their nature, arrays become pointers the instant you want to do anything useful with them. Array subscription is syntactic sugar over pointer arithmetic ( foo[2] is the same as *(foo + 2) ). Passing an array to a function will cause it to decay to a pointer to the first element.

You use pointers several times in your code, whether you realise it or not.


As for the code...

Quick note: size_t , not int , is the appropriate type to use when working with memory sizes / indexing. I'll be using it in the "corrected" versions of the code, and you should try to use it in general, moving forward.

The output is a bit confusing because everything is printed on a single line. Let's clean that up, and add some debugging information, like the length of each string you've stored.

for (size_t x = 0; x < 20; x++) {
    printf("%zu [length: %zu]: ", x, strlen(full[x]));

    for (size_t y = 0; y < 20; y++)
        printf("%c", full[x][y]);

    putchar('\n');
}

Now we get the output, across several lines (some duplicates collapsed for brevity), of:

0 [length: 5]: hello
1 [length: 0]:  world
2 [length: 0]:  from
3 [length: 0]:  jor
4 [length: 3]: dan
5 [length: 0]: 
...
19 [length: 0]: 

From this we can see a few notable things.

  • We have an additional, fifth "string", when we were only expecting four.
  • Our first and fifth "strings" have the apparent correct length, whilst
  • Our second through fourth "strings" have an apparent length of 0 , and would seem to include spaces.

The zero lengths mean some of our arrays are starting with the null-terminating byte ( '\0' ), and we are only seeing output because we manually walk the entirety of each array.

Note that most terminals will do "nothing" when a null character is to be printed, meaning we appear to skip directly to our "strings". We can better visualize what is happening by always printing something:

printf("%c", full[x][y] ? full[x][y] : '*');

In this case we print an asterisk when we encounter a null character, giving us the output:

0 [length: 5]: hello***************
1 [length: 0]: ***** world*********
2 [length: 0]: *********** from****
3 [length: 0]: **************** jor
4 [length: 3]: dan*****************
5 [length: 0]: ********************
...
19 [length: 0]: ********************

This very clearly shows where in memory our characters have been placed.

The primary issue is that in this loop

for (int j = end_word; j < start; j++) {
    full[counter][j] = name[j];
}

j is initialized to a position relative to the beginning of name , but is used to index a memory offset of full . Excluding our first substring, when end_word is 0 , this puts us farther and farther away from the zeroth index of each subarray, eventually crossing the borders between arrays.

This happens to work because 2D arrays in C are laid out contiguously in memory.

To fix this, we must copy our characters using a separate index that starts at zero for each subarray.

for (size_t j = end_word, k = 0; j < start; j++, k++) {
    full[counter][k] = name[j];
}

Now when we print our arrays out we can limit ourselves to our known number_of_words ( for (size_t x = 0; x < number_of_words; x++) ), giving us the output:

0 [length: 5]: hello***************
1 [length: 6]:  world**************
2 [length: 5]:  from***************
3 [length: 7]:  jordan*************

This looks roughly correct, but includes the preceding space in the "word". We can skip past these spaces by setting end_word to the next character instead:

start++;
end_word = start;
counter++;

Now our output looks properly split:

0 [length: 5]: hello***************
1 [length: 5]: world***************
2 [length: 4]: from****************
3 [length: 6]: jordan**************

Note that these are (now properly formatted) null-terminated strings , and could be printed using the %s specifier, as in:

for (size_t x = 0; x < number_of_words; x++)  
    printf("%zu [length: %zu]: %s\n", x, strlen(full[x]), full[x]);

Overall this is a bit fragile, as it requires the trailing delimiting space in order to work, and will create an empty string each time a delimiting space is repeated (or if the source string starts with a space).


As an aside, this similar example should showcase a straight-forward method for tokenizing a string, while skipping over all delimiters, and includes some important annotations.

#include <stdio.h>
#include <string.h>

int main(void) {
    char name[30] = "hello world from jordan";
    char copies[20][30] = { 0 };
    size_t length_of_copies = 0;

    size_t hold_position = 0;
    size_t substring_span = 0;
    size_t i = 0;

    do {
        /* our substring delimiters */
        if (name[i] == ' ' || name[i] == '\0') {
            /* only copy non-zero spans of non-delimiters */
            if (substring_span) {
                /* `strncpy` will not insert a null terminating character
                 * into the destination if it is not found within the span
                 * of characters of the source string...
                 */
                strncpy(
                    copies[length_of_copies],
                    name + hold_position,
                    substring_span
                );

                /* ...so we must manually insert a null terminating character
                 * (or otherwise rely on our memory being initialized to all-zeroes)
                 * */
                copies[length_of_copies++][substring_span] = '\0';
                substring_span = 0;
            }

            /* let's assume our next position will be the start of a substring */
            hold_position = i + 1;
        } else
            substring_span++;

        /* checking our character at the end of the loop,
         * and incrementing after the fact,
         * let's us include the null terminating character as a delimiter,
         * as we will only fail to enter the loop after processing it
         */
    } while (name[i++] != '\0');

    for (size_t i = 0; i < length_of_copies; i++)
        printf("%zu: [%s]\n", i + 1, copies[i]);
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM