简体   繁体   中英

How to apply Memory Allocation to a C Program which counts the amount of words in a list? (eg. malloc, calloc, free)

Considering the code provided by @David C. Rankin in this previous answer:

How to count only words that start with a Capital in a list?

How do you optimise this code to include Memory Allocation for much larger text files? With this code below it will complete for small .txt files.

However, what is the best way to set memory allocation to this code so that C (Programming Language) does not run out of memory. Is it best to use linked lists?

/**
 * C program to count occurrences of all words in a file.
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <limits.h>

#define MAX_WORD     50     /* max word size */
#define MAX_WORDS   512     /* max number of words */

#ifndef PATH_MAX
#define PATH_MAX   2048     /* max path (defined for Linux in limits.h) */
#endif

typedef struct {            /* use a struct to hold */
    char word[MAX_WORD];    /* lowercase word, and */
    int cap, count;         /* if it appeast capitalized, and its count */
} words_t;

char *strlwr (char *str)    /* no need for unsigned char */
{
    char *p = str;

    while (*p) {
        *p = tolower(*p);
        p++;
    }

    return str;
}

int main (void) {

    FILE *fptr;
    char path[PATH_MAX], word[MAX_WORD];
    size_t i, len, index = 0;

    /* Array of struct of distinct words, initialized all zero */
    words_t words[MAX_WORDS] = {{ .word = "" }};

    /* Input file path */
    printf ("Enter file path: ");
    if (scanf ("%s", path) != 1) {  /* validate every input */
        fputs ("error: invalid file path or cancellation.\n", stderr);
        return 1;
    }

    fptr = fopen (path, "r");   /* open file */
    if (fptr == NULL) {         /* validate file open */
        fputs ( "Unable to open file.\n"
                "Please check you have read privileges.\n", stderr);
        exit (EXIT_FAILURE);
    }

    while (index < MAX_WORDS &&                 /* protect array bounds  */
            fscanf (fptr, "%s", word) == 1) {   /* while valid word read */
        int iscap = 0, isunique = 1;    /* is captial, is unique flags */

        if (isupper (*word))            /* is the word uppercase */
            iscap = 1;

        /* remove all trailing punctuation characters */
        len = strlen (word);                    /* get length */
        while (len && ispunct(word[len - 1]))   /* only if len > 0 */
            word[--len] = 0;

        strlwr (word);                  /* convert word to lowercase */

        /* check if word exits in list of all distinct words */
        for (i = 0; i < index; i++) {
            if (strcmp(words[i].word, word) == 0) {
                isunique = 0;               /* set unique flag zero */
                if (iscap)                  /* if capital flag set */
                    words[i].cap = iscap;   /* set capital flag in struct */
                words[i].count++;           /* increment word count */
                break;                      /* bail - done */
            }
        }
        if (isunique) { /* if unique, add to array, increment index */
            memcpy (words[index].word, word, len + 1);  /* have len */
            if (iscap)                      /* if cap flag set */
                words[index].cap = iscap;   /* set capital flag in struct */
            words[index++].count++;         /* increment count & index */
        }
    }
    fclose (fptr);  /* close file */

    /*
     * Print occurrences of all words in file.
     */
    puts ("\nOccurrences of all distinct words with Cap in file:");
    for (i = 0; i < index; i++) {
        if (words[i].cap) {
            strcpy (word, words[i].word);
            *word = toupper (*word);
            /*
             * %-15s prints string in 15 character width.
             * - is used to print string left align inside
             * 15 character width space.
             */
            printf("%-15s %d\n", word, words[i].count);
        }
    }

    return 0;
}

Example Use/Output

Using your posted input

$ ./bin/unique_words_with_cap
Enter file path: dat/girljumped.txt

Occurrences of all distinct words with Cap in file:
Any             7
One             4
Some            10
The             6
A               13

However, what is the best way to set memory allocation to this code so that C (Programming Language) does not run out of memory.

Notice that most computers, even cheap laptops, have quite a lot of RAM. In practice, you could expect to be able to allocate at least a gigabyte of memory. That is a lot for textual file processing!

A large human-written text file is the Bible. As a rule of thumb, that text takes about 16 megabytes (to a factor of two). For most computers, that is a quite small amount of memory today (my AMD2970WX has more than that in its CPU cache ).

Is it best to use linked lists?

The practical consideration is more algorithmic time complexity than memory consumption. For example, searching something in a linked list has linear time. And going thru a list of a million words does take some time (even if computers are fast).

You may want to read more about:

  • flexible array members (use that instead in your word_t ).
  • string duplication routines like strdup or asprintf . Even if you don't have them, reprogramming them is a fairly easy task.

But you still want to avoid memory leaks and also, and even more importantly, undefined behavior .

Read How to debug small programs . Tools like valgrind , the clang static analyzer , the gdb debugger , the address sanitizer , etc.. are very useful to learn and use.

At last, read carefully, and in full, Norvig's Teach yourself programming in 10 years . That text is thought provoking, and its appendix at least is surprisingly close to your questions.

PS. I leave you to guess and estimate the total amount of text, in bytes, you are capable of reading during your entire life. That size is surprisingly small and probably fits in any smartphone today. On today's devices, text is really cheap. Photos and videos are not.

NB. "What is the best way" types of question are too broad, off-topic here, matter of opinion, and related to P vs NP question. Rice's theorem and to the halting problem . These questions usually have no clear answer and are supposed to be unsolvable: it is often difficult to prove that a better answer could not be thought of in a dozen of years (even if, for some such questions, you could get a proof today: eg sorting is proved today to require at least O(n log n) time.).

Since you already have an answer using a fixed-size array of struct to hold the information, changing from using the fixed-size array where storage is automatically reserved for you on the stack, to dynamically allocated storage where you can realloc as needed, simply requires initially declaring a pointer-to-type rather than array-of-type, and then allocating storage for each struct.

Where before, with a fixed-size array of 512 elements you would have:

#define MAX_WORDS   512     /* max number of words */
...
    /* Array of struct of distinct words, initialized all zero */
    words_t words[MAX_WORDS] = {{ .word = "" }};

When dynamically allocating, simply declare a pointer-to-type and provide an initial allocation of some reasonable number of elements, eg

#define MAX_WORDS     8     /* initial number of struct to allocate */
...
    /* pointer to allocated block of max_words struct initialized zero */
    words_t *words = calloc (max_words, sizeof *words);

( note: you can allocate with either malloc, calloc or realloc , but only calloc allocates and also sets all bytes zero. In your case since you want the .cap and .count members initialized zero, calloc is a sensible choice)

It's worth pausing a bit to understand whether you use a fixed size array or an allocated block of memory, you are accessing your data through a pointer to the first element. The only real difference is the compiler reserving storage for your array on the stack with a fixed array, and you being responsible for reserving storage for it through allocation.

Access to the elements will be exactly the same because on access, an array is converted to a pointer to the first element. See: C11 Standard - 6.3.2.1 Other Operands - Lvalues, arrays, and function designators(p3) Either way you access the memory through a pointer to the first element. When dynamically allocating, you are assigning the address of the first element to your pointer rather than the compiler reserving storage for the array. Whether it is an array with storage reserved for you, or you declare a pointer and assign an allocated block of memory to it -- how you access the elements will be identical. (pause done)

When you allocate, it is up to you to validate that the allocation succeeds. So you would follow your allocation with:

    if (!words) {   /* valdiate every allocation */
        perror ("calloc-words");
        exit (EXIT_FAILURE);
    }

You are already keeping track of index telling you how many struct you have filled , you simply need to add one more variable to track how many struct are available ( size_t max_words = MAX_WORDS; gives you the 2nd variable set to the initial allocation size MAX_WORDS ). So your test for "Do I need to realloc now?" is simply when filled == available , or in your case if (index == max_words) .

Since you now have the ability to realloc , your read loop no longer has to protect your array bounds and you can simply read each word in the file, eg

    while (fscanf (fptr, "%s", word) == 1) {  /* while valid word read */
        int iscap = 0, isunique = 1;    /* is captial, is unique flags */
        ...

Now all that remains is the index == max_words test before you fill another element. You can either place the test and realloc before the for and if blocks for handling isunique , which is fine, or you can actually place it within the if (isunique) block since technically unless you are adding a unique word, no realloc will be required. The only difference it makes is a corner-case where index == max_words and you call realloc before your for loop where the last word is not-unique, you may make one call to realloc where it wasn't technically required (think through that).

To prevent that one realloc too many, place the test and realloc immediately before the new element will be filled, eg

        if (isunique) { /* if unique, add to array, increment index */
            if (index == max_words) {       /* is realloc needed? */
                /* always use a temporary pointer with realloc */
                void *tmp = realloc (words, 2 * max_words * sizeof *words);
                if (!tmp) {
                    perror ("realloc-words");
                    break;  /* don't exit, original data still valid */
                }
                words = tmp;    /* assign reallocated block to words */
                /* (optional) set all new memory to zero */
                memset (words + max_words, 0, max_words * sizeof *words);
                max_words *= 2; /* update max_words to reflect new limit */
            }
            memcpy (words[index].word, word, len + 1);  /* have len */
            if (iscap)                      /* if cap flag set */
                words[index].cap = iscap;   /* set capital flag in struct */
            words[index++].count++;         /* increment count & index */
        }

Now let's look closer at the reallocation itself, eg

            if (index == max_words) {       /* is realloc needed? */
                /* always use a temporary pointer with realloc */
                void *tmp = realloc (words, 2 * max_words * sizeof *words);
                if (!tmp) { /* validate every allocation */
                    perror ("realloc-words");
                    break;  /* don't exit, original data still valid */
                }
                words = tmp;    /* assign reallocated block to words */
                /* (optional) set all new memory to zero */
                memset (words + max_words, 0, max_words * sizeof *words);
                max_words *= 2; /* update max_words to reflect new limit */
            }

The realloc call itself is void *tmp = realloc (words, 2 * max_words * sizeof *words); . Why not just words = realloc (words, 2 * max_words * sizeof *words); ? Answer: You Never realloc the pointer itself, and always use a temporary pointer. Why? realloc allocates new storage, copies the existing data to the new storage and then calls free() on the old block of memory. When (not If) realloc fails, it returns NULL and does not touch the old block of memory. If you blindly assign NULL to your exiting pointer words , you have just overwritten the address to your old block of memory with NULL creating a memory-leak because you no longer have a reference to the old block of memory and it cannot be freed. So lesson learned, Always realloc with a temporary pointer!

If realloc succeeds, what then? Pay close attention to the lines:

                words = tmp;    /* assign reallocated block to words */
                /* (optional) set all new memory to zero */
                memset (words + max_words, 0, max_words * sizeof *words);
                max_words *= 2; /* update max_words to reflect new limit */

The first simply assigns the address for the new block of memory created and filled by realloc to your words pointer. (`words now points to a block of memory with twice as many elements as it had before).

The second line -- recall, realloc and malloc do not initialize the new memory to zero, if you want to initialize the memory zero, (which for your .cap and .count members is really helpful, you have to do that yourself with memset . So what needs to be set to zero? All the memory what wasn't in your original block. Where is that? Well, it starts at words + max_words . How many zeros do I have to write? You have to fill all memory above words + max_words to the end of the block. Since you doubled the size, you simply have to zero what was the original size starting at words + max_words which is max_words * sizeof *words bytes of memory. (remember we used 2 * max_words * sizeof *words as the new size, and we have NOT updated max_words yet, so it still holds the original size)

Lastly, now it is time to update max_words . Here just make it match whatever you added to your allocation in realloc above. I simply doubled the size of the current allocation each time realloc is called, so to update max_words to the new allocation size, you simply multiply by 2 with max_words *= 2; . You can add as little or a much memory as you like each time. You could scale by 3/2. , you could add a fixed number of elements (say 10 ), it is completely up to you, but avoid calling realloc to add 1-element each time. You can do it, but allocation and reallocation are relatively expensive operations, so better to add a reasonably sized block each time you realloc , and doubling is a reasonable balance between memory growth and the number of times realloc is called.

Putting it altogether, you could do:

/**
 * C program to count occurrences of all words in a file.
 */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <limits.h>

#define MAX_WORD     50     /* max word size */
#define MAX_WORDS     8     /* initial number of struct to allocate */

#ifndef PATH_MAX
#define PATH_MAX   2048     /* max path (defined for Linux in limits.h) */
#endif

typedef struct {            /* use a struct to hold */
    char word[MAX_WORD];    /* lowercase word, and */
    int cap, count;         /* if it appeast capitalized, and its count */
} words_t;

char *strlwr (char *str)    /* no need for unsigned char */
{
    char *p = str;

    while (*p) {
        *p = tolower(*p);
        p++;
    }

    return str;
}

int main (void) {

    FILE *fptr;
    char path[PATH_MAX], word[MAX_WORD];
    size_t i, len, index = 0, max_words = MAX_WORDS;

    /* pointer to allocated block of max_words struct initialized zero */
    words_t *words = calloc (max_words, sizeof *words);
    if (!words) {   /* valdiate every allocation */
        perror ("calloc-words");
        exit (EXIT_FAILURE);
    }

    /* Input file path */
    printf ("Enter file path: ");
    if (scanf ("%s", path) != 1) {  /* validate every input */
        fputs ("error: invalid file path or cancellation.\n", stderr);
        return 1;
    }

    fptr = fopen (path, "r");   /* open file */
    if (fptr == NULL) {         /* validate file open */
        fputs ( "Unable to open file.\n"
                "Please check you have read privileges.\n", stderr);
        exit (EXIT_FAILURE);
    }

    while (fscanf (fptr, "%s", word) == 1) {  /* while valid word read */
        int iscap = 0, isunique = 1;    /* is captial, is unique flags */

        if (isupper (*word))            /* is the word uppercase */
            iscap = 1;

        /* remove all trailing punctuation characters */
        len = strlen (word);                    /* get length */
        while (len && ispunct(word[len - 1]))   /* only if len > 0 */
            word[--len] = 0;

        strlwr (word);                  /* convert word to lowercase */

        /* check if word exits in list of all distinct words */
        for (i = 0; i < index; i++) {
            if (strcmp(words[i].word, word) == 0) {
                isunique = 0;               /* set unique flag zero */
                if (iscap)                  /* if capital flag set */
                    words[i].cap = iscap;   /* set capital flag in struct */
                words[i].count++;           /* increment word count */
                break;                      /* bail - done */
            }
        }
        if (isunique) { /* if unique, add to array, increment index */
            if (index == max_words) {       /* is realloc needed? */
                /* always use a temporary pointer with realloc */
                void *tmp = realloc (words, 2 * max_words * sizeof *words);
                if (!tmp) { /* validate every allocation */
                    perror ("realloc-words");
                    break;  /* don't exit, original data still valid */
                }
                words = tmp;    /* assign reallocated block to words */
                /* (optional) set all new memory to zero */
                memset (words + max_words, 0, max_words * sizeof *words);
                max_words *= 2; /* update max_words to reflect new limit */
            }
            memcpy (words[index].word, word, len + 1);  /* have len */
            if (iscap)                      /* if cap flag set */
                words[index].cap = iscap;   /* set capital flag in struct */
            words[index++].count++;         /* increment count & index */
        }
    }
    fclose (fptr);  /* close file */

    /*
     * Print occurrences of all words in file.
     */
    puts ("\nOccurrences of all distinct words with Cap in file:");
    for (i = 0; i < index; i++) {
        if (words[i].cap) {
            strcpy (word, words[i].word);
            *word = toupper (*word);
            /*
             * %-15s prints string in 15 character width.
             * - is used to print string left align inside
             * 15 character width space.
             */
            printf("%-15s %d\n", word, words[i].count);
        }
    }
    free (words);

    return 0;
}

Example Use/Output

Where with your sample data you would get:

$ ./bin/unique_words_with_cap_dyn
Enter file path: dat/girljumped.txt

Occurrences of all distinct words with Cap in file:
Any             7
One             4
Some            10
The             6
A               13

Memory Use/Error Check

In any code you write that dynamically allocates memory, you have 2 responsibilities regarding any block of memory allocated: (1) always preserve a pointer to the starting address for the block of memory so, (2) it can be freed when it is no longer needed.

It is imperative that you use a memory error checking program to insure you do not attempt to access memory or write beyond/outside the bounds of your allocated block, attempt to read or base a conditional jump on an uninitialized value, and finally, to confirm that you free all the memory you have allocated.

For Linux valgrind is the normal choice. There are similar memory checkers for every platform. They are all simple to use, just run your program through it.

$ valgrind ./bin/unique_words_with_cap_dyn
==7962== Memcheck, a memory error detector
==7962== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==7962== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==7962== Command: ./bin/unique_words_with_cap_dyn
==7962==
Enter file path: dat/girljumped.txt

Occurrences of all distinct words with Cap in file:
Any             7
One             4
Some            10
The             6
A               13
==7962==
==7962== HEAP SUMMARY:
==7962==     in use at exit: 0 bytes in 0 blocks
==7962==   total heap usage: 4 allocs, 4 frees, 3,912 bytes allocated
==7962==
==7962== All heap blocks were freed -- no leaks are possible
==7962==
==7962== For counts of detected and suppressed errors, rerun with: -v

Above you can see there were 4 allocations and 4 frees (original allocation of 8 , realloc at 8, 16 & 32 ) and you can see there were 0 errors.

Always confirm that you have freed all memory you have allocated and that there are no memory errors.

Look things over and let me know if you have any questions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM