简体   繁体   中英

C remove special characters from string

I am very new to C, and I have created a function that removes special characters from a string and returns a new string (without the special characters).

At first glance, this seemed to be working well, I now need to run this function on the lines of a (huge) text file (1 Million sentences). After a few thousand lines/sentences (About 4,000) I get a seg fault.

I don't have much experience with memory allocation and strings in C, I have tried to figure out what the problem with my code is, unfortunately without any luck. Here is the code:

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>

char *preproccessString(char *str) {
    // Create a new string of the size of the input string, so this might be bigger than needed but should never be too small
    char *result = malloc(sizeof(str));
    // Array of allowed chars with a 0 on the end to know when the end of the array is reached, I don't know if there is a more elegant way to do this
    // Changed from array to string for sake of simplicity
    char *allowedCharsArray = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
    // Initalize two integers
    // i will be increased for every char in the string
    int i = 0;
    // j will be increased every time a new char is added to the result
    int j = 0;
    // Loop over the input string
    while (str[i] != '\0') {
        // l will be increased for every char in the allowed chars array
        int l = 0;
        // Loop over the chars in the allowed chars array
        while (allowedCharsArray[l] != '\0') {
            // If the char (From the input string) currently under consideration (index i) is present in the allowed chars array
            if (allowedCharsArray[l] == toupper(str[i])) {
                // Set char at index j of result string to uppercase version of char currently under consideration
                result[j] = toupper(str[i]);
                j++;
            }
            l++;
        }
        i++;
    }
    return result;
}

Here is the rest of the program, I think the problem is probably here.

int main(int argc, char *argv[]) {
    char const * const fileName = argv[1];
    FILE *file = fopen(fileName, "r");
    char line[256];

    while (fgets(line, sizeof(line), file)) {
        printf("%s\n", preproccessString(line)); 
    }

    fclose(file);

    return 0;
}

You have several problems.

  1. You're not allocating enough space. sizeof(str) is the size of a pointer, not the length of the string. You need to use
char *result = malloc(strlen(str) + 1);

+ 1 is for the terminating null byte.

  1. You didn't add a terminating null byte to the result string. Add
result[j] = '\0';

before return result;

  1. Once you find that the character matches an allowed character, there's no need to keep looping through the rest of the allowed characters. Add break after j++ .

  2. Your main() function is never freeing the results of preprocessString() , so you might be running out of memory.

while (fgets(line, sizeof(line), file)) {
    char *processed = preproccessString(line);
    printf("%s\n", processed); 
    free(processed);
}

You could address most of these problems if you have the caller pass in the result string, instead of allocating it in the function. Just use two char[256] arrays in the main() function.

int main(int argc, char *argv[])
{
    char const* const fileName = argv[1];
    FILE* file = fopen(fileName, "r");
    char line[256], processed[256];

    while (fgets(line, sizeof(line), file)) {
        processString(line, processed);
        printf("%s\n", processed); 
    }

    fclose(file);

    return 0;
}

Then just change the function so that the parameters are:

void preprocessString(const char *str, char *result)

A good rule of thumb is to make sure there is one free for every malloc/calloc call.

Also, a good tool to keep note of for the future is Valgrind. It's very good at catching these kinds of errors.

The following proposed code:

  1. cleanly compiles
  2. performs the desired functionality
  3. properly checks for errors
  4. properly checks for length of input string parameter
  5. makes use of characteristic of strchr() also checking the terminating NUL byte
  6. limits scope of visibility of local variables
  7. the calling function is expected to properly cleaning up by passing the returned value to free()
  8. the calling function is expected to check the returned value for NULL
  9. informs compiler the user knows and accepts when an implicit conversion is made.
  10. moves allowedCharsArray to 'file static scope' so does not have to be re-initialized on each pass through the loop and marks as 'const' to help the compiler catch errors

and now the proposed code: (note: edited per comments)

#include <stdio.h>
#include <stdlib.h>
#include <ctype.h>
#include <string.h>

char *preproccessString(char *str) 
{
    // Create a new string of the size of the input string, so this might be bigger than needed but should never be too small
    char *result = calloc( sizeof( char ),  strlen(str)+1);
    if( !result )
    {
        perror( "calloc failed" );
        return NULL;
    }

    // Array of allowed chars 
    static const char *allowedCharsArray = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

    // Loop over the input string
    for( int  j=0, i=0; str[i]; i++) 
    {
        if( strchr( allowedCharsArray, (char)toupper( str[i] ) ) )
        {
            // Set char at index j of result string to uppercase version of char currently under consideration
            result[j] = (char)toupper(str[i]);
            j++;
        }
    }
    return result;
}

There are some major issues in your code:

  • the amount of memory allocated is incorrect, sizeof(str) is the number of bytes in a pointer , not the length of the string it points to, which would also be incorrect. You should write char *result = malloc(strlen(str) + 1);

  • the memory allocated in preproccessString is never freed, causing memory leaks and potentially for the program to run out of memory on very large files.

  • you do not set a null terminator at the end of the result string

Lesser issues:

  • you do not check if filename was passed nor if fopen() succeeded.
  • there is a typo in preproccessString , it should be preprocessString
  • you could avoid memory allocation by passing a properly sized destination array.
  • you could use isalpha instead of testing every letter
  • you should cast the char values as unsigned char when passing them to toupper because char may be a signed type and toupper is undefined for negative values except EOF .
  • there are too many comments in your source file, most of which are obvious but make the code less readable.

Here is a modified version:

#include <ctype.h>
#include <errno.h>
#include <stdio.h>
#include <stdlib.h>

// transform the string in `str` into buffer dest, keeping only letters and uppercasing them.
char *preprocessString(char *dest, const char *str) {
    int i, j;
    for (i = j = 0; str[i] != '\0'; i++) {
        if (isalpha((unsigned char)str[i])
            dest[j++] = toupper((unsigned char)str[i]);
    }
    dest[j] = '\0';
    return dest;
}

int main(int argc, char *argv[]) {
    char line[256];
    char dest[256];
    char *filename;
    FILE *file;

    if (argc < 2) {
        fprintf(stderr, "missing filename argument\n");
        return 1;
    }
    filename = argv[1];
    if ((file = fopen(filename, "r")) == NULL) {
        fprintf(stderr, "cannot open %s: %s\n", filename, strerror(errno));
        return 1;
    }
    while (fgets(line, sizeof(line), file)) {
        printf("%s\n", preprocessString(dest, line)); 
    }
    fclose(file);

    return 0;
}

I think the problem is you are using malloc which allocates memory from the heap and since you are calling this function again and again you are running out of memory. To solve this issue you have to call the free() function on the pointer returned by your preprocessString function In your main block

char *result=preprocessString(inputstring);
//Do whatever you want to do with this result
free(result);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM