简体   繁体   中英

How do you count the frequency of which a word of n length occurs within a string

I have this code here that correctly formats the hard-coded sentence and finds the frequency of which a certain letter shows up in that string:

#include <stdio.h>
#include <string.h>

int main() {
    char words[1000][100];
    int x = 0, y;

    char myString[10000] = "The quick Brown ? Fox ? jumps over the Lazy Dog and the !##! LAZY DOG is still sleeping";
    printf("Original Text:\n");
    printf("%s\n", myString);
   
    // Function for uppercase letters to become lowercase and to remove special characters
    for (x = 0; x <= strlen(myString); ++x) {
        if (myString[x] >= 65 && myString[x] <= 90)
            myString[x] = myString[x] + 32;
    }
    for (x = 0; myString[x] != '\0'; ++x) {
        while (!(myString[x] >= 'a' && myString[x] <= 'z') &&
               !(myString[x] >= 'A' && myString[x] <= 'Z') &&
               !(myString[x] >= '0' && myString[x] <= '9') &&
               !(myString[x] == '\0') && !(myString[x] == ' ')) {
            for (y = x; myString[y] != '\0'; ++y) {
                myString[y] = myString[y + 1];
            }
            myString[y] = '\0';
        }
    }
   
    printf("\nModified Text: \n%s\n", myString);

    // Part A
    int counts[26] = { 0 };
    int k;
    size_t myString_length = strlen(myString);

    for (k = 0; k < myString_length; k++) {
        char c = myString[k];
        if (!isalpha(c))
            continue;
        counts[(int)(c - 'a')]++;
    }
   
    printf("\nLetter\tCount\n------  -----\n");
    
    for (k = 0; k < 26; ++k) {
        printf("%c\t%d\n", k + 'a', counts[k]);
    }

    // Part B
    int i = 0, count = 0, occurrences[10000] = { 0 };
 
    while (myString[i] != '\0') {
        char wordArray[100];
        int j = 0;
       
        while (myString[i] != ' ' && myString[i] != '\0') {
            wordArray[j++] = myString[i++];
        }
     
        if (wordArray[j - 1] == ',' || wordArray[j - 1] == '.') {
            wordArray[j - 1] = '\0';
        }

        wordArray[j] = '\0';

        int status = -1;
    
        for (j = 0; j < count; ++j) {
            if (strcmp(words[j], wordArray) == 0) {
                status = j;
                break;
            }
        }
    
        if (status != -1) {
            occurrences[status] += 1;
        } else {
            occurrences[count] += 1;
            strcpy(words[count++], wordArray);
        }
        ++i;
    }
 
    printf("\nWord Length\tOccurrences\n-----------     -----------\n");
 
    for (i = 0; i < count; ++i) {
        // print each word and its occurrences
        printf("%s\t\t%d\n", words[i], occurrences[i]);
    }
}

Part B is where I'm having a problem though, I want the code to be able to tell me the occurrence of which a word of a specific length shows up, such as this instance:

Word length Occurrences
1           0
2           1

Here, there are no instances where there is a word with one character, but there is one instance where there is a word with two characters. However, my code is outputting the number of times a specific word is given and not what I want above, like this:

Word Length     Occurrences
-----------     -----------
the             3
quick           1
brown           1
                3
fox             1
jumps           1
over            1
lazy            2
dog             2
and             1
is              1
still           1
sleeping                1

How would I go about changing it so that it shows the output I want with just the word length and frequency?

Here are some remarks about your code:

  • the first loop recomputes the length of the string for each iteration: for (x = 0; x <= strlen(myString); ++x) . Since you modify the string inside the loop, it is difficult for the compiler to ascertain that the string length does not change, so a classic optimisation may not work. Use the same test as for the next loop:

     for (x = 0; myString[x];= '\0'; ++x)
  • the test for uppercase is not very readable because you hardcode the ASCII values of the letters A and Z , you should either write:

     if (myString[x] >= 'A' && myString[x] <= 'Z') myString[x] += 'a' - 'A';

    or use macros from <ctype.h> :

     unsigned char c = myString[x]; if (isupper(c)) myString[x] = tolower(c);

    or equivalently and possibly more efficiently:

     myString[x] = tolower((unsigned char)myString[x]);
  • in the second loop, you remove characters that are neither letters, digits nor spaces. You have a redundant nested while loop and a third nested loop to shift the rest of the array for each byte removed: this method has cubic time complexity, O(N 3 ) , very inefficient. You should instead use a two finger method that operates in linear time:

     for (x = y = 0; myString[x];= '\0'; ++x) { unsigned char c = myString[x]; if (;isalnum(c) && c != ' ') { myString[y++] = c; } } myString[y] = '\0';
  • note that this loop removes all punctuation instead of replacing it with spaces: this might glue words together such as "a fine,good man" -> "a finegood man"

  • In the third loop, you use a char value c as an argument for isalpha(c) . You should include <ctype.h> to use any function declared in this header file. Functions and macros from <ctype.h> are only defined for all values of the type unsigned char and the special negative value EOF . If type char is signed on your platform, isalpha(c) would have undefined behavior if the string has negative characters. In your particular case, you filtered characters that are not ASCII letters, digits or space, so this should not be a problem, yet it is a good habit to always use unsigned char for the character argument to isalpha() and equivalent functions.

  • Note also that this counting phase could have been combined into the previous loops.

  • to count the occurrences of words, the array occurrences should have the same number of elements as the words array, 1000. You do not check for boundaries so you have undefined behavior if there are more than 1000 different words and/or if any of these words has 100 characters or more.

  • in the next loop, you extract words from the string, incrementing i inside the nested loop body. You also increment i at the end of the outer loop, hence skipping the final null terminator. The test while (myString[i] != '\0') will test bytes beyond the end of the string, which is incorrect and potential undefined behavior.

  • to avoid counting empty words in this loop, you should skip sequences of spaces before copying the word if not at the end of the string.

  • According to the question, counting individual words is not what Part B is expected to do, you should instead count the frequency of word lengths. You can do this in the first loop by keeping track of the length of the current word and incrementing the array of word length frequencies when you find a separator.

  • Note that modifying the string is not necessary to count letter frequencies or word length occurrences.

  • Writing a separate function for each task is recommended.

Here is a modified version:

#include <ctype.h>
#include <stdio.h>

#define MAX_LENGTH 100

// Function to lowercase letters and remove special characters
void clean_string(char *str) {
    int x, y;

    printf("Original Text:\n");
    printf("%s\n", str);

    for (x = y = 0; str[x] != '\0'; x++) {
        unsigned char c = str[x];
        c = tolower(c);
        if (isalnum(c) || c == ' ') {
            str[y++] = c;
        }
    }
    str[y] = '\0';

    printf("\nModified Text:\n%s\n", str);
}

// Part A: count letter frequencies
void count_letters(const char *str) {
    int letter_count['z' - 'a' + 1] = { 0 };

    for (int i = 0; str[i] != '\0'; i++) {
        unsigned char c = str[i];
        if (c >= 'a' && c <= 'z') {
            letter_count[c - 'a'] += 1;
        } else
        if (c >= 'A' && c <= 'Z') {
            letter_count[c - 'A'] += 1;
        }
    }

    printf("\nLetter\tCount"
           "\n------\t-----\n");
    for (int c = 'a'; c <= 'z'; c++) {
        printf("%c\t%d\n", c, letter_count[c - 'a']);
    }
}

// Part B: count word lengths frequencies
void count_word_lengths(const char *str) {
    int length_count[MAX_LENGTH + 1] = { 0 };

    for (int i = 0, len = -1;; i++) {
        unsigned char c = str[i];
        // counting words as sequences of letters or digits
        if (isalnum(c)) {
            len++;
        } else {
            if (len >= 0 && len <= MAX_LENGTH) {
                length_count[len] += 1;
                len = -1;
            }
        }
        if (c == '\0')
            break;
    }

    printf("\nWord Length\tOccurrences"
           "\n-----------\t-----------\n");
    for (int len = 0; len <= MAX_LENGTH; len++) {
        if (length_count[len]) {
            printf("%-11d\t%d\n", len, length_count[len]);
        }
    }
}

int main() {
    char myString[] = "The quick Brown ? Fox ? jumps over the Lazy Dog and the !##! LAZY DOG is still sleeping";

    // Uncomment if modifying the string is required
    //clean_string(myString);

    count_letters(myString);
    count_word_lengths(myString);
    return 0;
}

Output:

Letter  Count
------  -----
a       3
b       1
c       1
d       3
e       6
f       1
g       3
h       3
i       4
j       1
k       1
l       5
m       1
n       3
o       5
p       2
q       1
r       2
s       4
t       4
u       2
v       1
w       1
x       1
y       2
z       2

Word Length     Occurrences
-----------     -----------
1               1
2               7
3               3
4               4
7               1

Use strtok_r() and simplify counting.
It's sibling strtok() is not thread-safe. Discussed in detail in Why is strtok() Considered Unsafe?

Also, strtok_r() chops input string by inserting \0 chars inside the string. If you want to keep a copy of original string, you have to make a copy of original string and pass it on to strtok_r() .

There is also another catch. strtok_r() is not a part of C-Standard yet, but POSIX-2008 lists it. GNU glibc implements it, but to access this function we need to #define _POSIX_C_SOURCE before any includes in our source files.

There is also strdup() & strndup() which duplicate an input string, they allocate memory for you. You've to free that string-memory when you're done using it. strndup() was added in POSIX-2008 so we declare 200809L in our sources to use it.

It's always better to use new standards to write fresh code. POSIX 200809L is recommended with at least C standard 2011 .

#define _POSIX_C_SOURCE 200809L
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#define MAX_STR_LEN     1024
#define MAX_WORD_LEN    128
#define WORD_DELIMS     " \n\t"

int is_word (const char* str, const size_t slen) {
    int word = 0;
    for (size_t ci = 0; ci < slen;)
        if (isalnum (str[ci++])) {
            word = 1;
            break;
        }
    return word;
}

void get_word_stat (const char* str, int word_stat[]) {
    char *copy = strndup (str, MAX_STR_LEN); // limiting copy
    if (!copy) { // copying failed
        printf ("Error duplicating input string\n");
        exit (1);
    }
    for (char *token, *rmdStr = copy; (token = strtok_r (NULL, WORD_DELIMS, &rmdStr)); /* empty */) {
        size_t token_len = strlen (token);
        if (token_len > (MAX_WORD_LEN - 1)) {
            printf ("Error: Increase MAX_WORD_LEN(%d) to handle words of length %lu\n", MAX_WORD_LEN, token_len);
            exit (2);
        }
        if (is_word (token, token_len))
            ++word_stat[token_len];
        else
            printf ("[%s] not a word\n", token);
    }
    free (copy);
}

int main () {
    char str [MAX_STR_LEN] = "The quick Brown ? Fox ? jumps over the Lazy Dog and the !##! LAZY DOG is still sleeping";
    printf ("Original Text: [%s]\n", str);

    int word_stat[MAX_WORD_LEN] = {0};
    get_word_stat (str, word_stat);

    printf ("\nWordLength   Occurrences\n");
    for (int si = 1; si < MAX_WORD_LEN; ++si) {
        if (word_stat[si])
            printf ("%d\t\t%d\n", si, word_stat[si]);
    }
    return 0;
}

Whenever you are interested in the frequency that something occurs, you want to use a Frequency Array containing the number of elements necessary to handle the entire range of possible occurrence. You want to track the frequency of word-lengths, so you need an array that is sized to track the longest word. (longest word in the non-medical unabridged dictionary is 29-characters, longest medical word is 45-characters)

So here a simple array of integers with 29 elements will do (unless you want to consider medical words, then use 45). If you want to consider non-sense words, then size appropriately, eg "Supercalifragilisticexpialidocious" , 34-characters. Chose the type based on a reasonably anticipated maximum number of occurrences. Using signed int that limits the occurrences to INT_MAX ( 2147483647 ). Using unsigned will double the limit, or using uint64_t for a full 64-bit range.

How it works

How do you use a simple array to tract the occurrences of word lengths? Simple, declare an array of sufficient size and initialize all elements zero . Now all you do is read a word, use, eg size_t len = strlen(word); to get the length and then increment yourarray[len] += 1; .

Say the word has 10-characters, you will add one to yourarray[10] . So the array index corresponds word-length . When you have taken the length of all words and incremented the corresponding array index, to get your results, you just loop over your array and output the value (number of occurrences) at the index (word-length). If you have had two words that were 10-characters each, then yourarray[10] will contain 2 (and so on and so forth for every other index that corresponds to a different word-length number of characters).

Consideration When Choosing How to Separate Words

When selecting a method to split a string of space separated words into individual words, you need to know whether your original string is mutable . For example, if you choose to separate words with strtok() , it will modify the original string. In your case since your words are stored in an array or char , that is fine, but what if you had a string-literal like:

  char *mystring =  "The quick Brown ? Fox ? jumps over the Lazy Dog ";

In that case, passing mystring to strtok() would SEGFAULT when strtok() attempts to modify the region of read-only memory holding mystring (ignoring the non-standard treatment of string-literals by Microsoft)

You can of course make a copy of mystring and put the string-literal in mutable memory and then call strtok() on the copy. Or, you can use a method that does not modify mystring (like using sscanf() and an offset to parse the words, or using alternating calls to strcspn() and strspn() to locate and skip whitespace, or simply using a start and end pointer to work down the string bracketing words and copying characters between the pointers. Entirely up to you.

For example, using sscanf() with an offset to work down the string, updating the offset from the beginning with the number of characters consumed during each read you could do:

  char *mystring =  "The quick Brown ? Fox ? jumps over the Lazy Dog "
                    "and the !##! LAZY DOG is still sleeping",
       *p = mystring,         /* pointer to mystring to parse */
       buf[MAXLEN] = "";      /* temporary buffer to hold each word */
  int nchar = 0,              /* characters consumed by sscanf */
      offset = 0;             /* offset from beginning of mystring */
  
  /* loop over each word in mystring using sscanf and offset */
  while (sscanf (p + offset, "%s%n", buf, &nchar) == 1) {
    size_t len = strlen (buf);    /* length of word */
    
    offset += nchar;              /* update offset with nchar */
    
    /* do other stuff here */
  }

Testing if Words is Alphanum

You can loop over each character calling the isalnum() macro from ctype.h on each character. Or, you can let strspn() do it for you given a list of characters that your words can contain. For example for digits and alpha-characters only, you can use a simple constant, and then call strspn() in your loop to determine if the word is made up only of the characters you will accept in a word, eg

#define ACCEPT "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
...
    /* use strspn to test that word is valid (alphanum) or get next word */
    if (strspn (buf, ACCEPT) != len) {
      fprintf (stderr, "  error: rejecting \"%s\"\n", buf); /* optional */
      continue;
    }
    ...

Neither way is more-right than the other, it's really a matter of convenience and readability. Using a library provided function also provides a bit of confidence that it is written in a manner that will allow the compiler to fully optimize the compiled code.

A Short Example

Putting the thoughts above together in a short example that will parse the words in mystring using sscanf() and then track the occurrences of all alphanum words (up to 31-characters, and outputting any word rejected) using a simple array of integers to hold the frequency of length, you could do:

#include <stdio.h>
#include <string.h>

#define MAXLEN      32    /* if you need a constant, #define one (or more) */
#define ACCEPT "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"

int main (void) {
  
  char *mystring =  "The quick Brown ? Fox ? jumps over the Lazy Dog "
                    "and the !##! LAZY DOG is still sleeping",
       *p = mystring,         /* pointer to mystring to parse */
       buf[MAXLEN] = "";      /* temporary buffer to hold each word */
  int nchar = 0,              /* characters consumed by sscanf */
      offset = 0,             /* offset from beginning of mystring */
      lenfreq[MAXLEN] = {0};  /* frequency array for word length */
  
  /* loop over each word in mystring using sscanf and offset */
  while (sscanf (p + offset, "%s%n", buf, &nchar) == 1) {
    size_t len = strlen (buf);    /* length of word */
    
    offset += nchar;              /* update offset with nchar */
    
    /* use strspn to test that word is valid (alphanum) or get next word */
    if (strspn (buf, ACCEPT) != len) {
      fprintf (stderr, "  error: rejecting \"%s\"\n", buf); /* optional */
      continue;
    }
    
    lenfreq[len] += 1;      /* update frequency array of lengths */
  }
  
  /* output original string */
  printf ("\nOriginal Text:\n\n%s\n\n", mystring);
  
  /* output length frequency array */
  puts ("word length     Occurrences\n"
        "-----------     -----------");
  for (size_t i = 0; i < MAXLEN; i++) {
    if (lenfreq[i])
      printf ("%2zu%14s%d\n", i, " ", lenfreq[i]);
  }
}

Example Use/Output

Compiling and running the program would produce:

$ ./bin/wordlen-freq
  error: rejecting "?"
  error: rejecting "?"
  error: rejecting "!##!"

Original Text:

The quick Brown ? Fox ? jumps over the Lazy Dog and the !##! LAZY DOG is still sleeping

word length     Occurrences
-----------     -----------
 2              1
 3              7
 4              3
 5              4
 8              1

( note: you can output all lengths from 0 to 31 even if there were no occurrences by removing the print condition if (lenfreq[i]) -- up to you)

Look things over and let me know if you have questions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM