简体   繁体   中英

Printing the most frequent occurring words in a given text file, unable to sort by frequency in C

I am working on an assignment that requires me to print the top 10 most occurring words in a given text file. My code is printing the words from the file, but it is not sorting them according to their frequency.

Here is come of my code below. I use a hashtable to store each unique word and its frequency. I am currently sorting the words using the wordcmp function I wrote and calling it in the inbuilt qsort function in main.

If anyone can guide me to fix my error, I'd be very greatful.

My current output:

the top 10 words (out of 10) are:

1 im

1 are

1 again

3 happy

2 hello

1 how

1 lets

1 you

1 try

1 this

Expected output (what I want):

The top 10 words (out of 10) are:

3 happy

2 hello

1 you

1 try

1 this

1 lets

1 im

1 how

1 are

1 again

Here is some of my code:

typedef struct word
{ 
  char *s;          /* the word */
  int count;        /* number of times word occurs */
  struct word* next;
}word;

struct hashtable
{
  word **table;
  int tablesize;
  int currentsize;
};
typedef struct hashtable hashtable;
int main(int argc, char *argv[])
{

    int top_words = 10;
    word *word = NULL;
    hashtable *hash = ht_create(5000);
    char *file_name;
    char *file_word;
    FILE *fp;
    struct word *present = NULL;

    fp = fopen (file_name, "r");
    if (fp == NULL)
    {
        fprintf (stderr,"%s: No such file or directory\n", file_name);
        fprintf(stderr,"The top %d words (out of 0) are:\n", top_words); 
        exit(-1);
    }

    continue_program:
    while ((file_word = getWord(fp)))
    {
        word = add(hash, file_word, 1);
    }
    fclose(fp);

    qsort((void*)hash->table, hash->currentsize, sizeof(word),(int (*)(const void *, const void *)) wordcmp);

    if(top_words > total_unique_words)
          top_words = total_unique_words;

    printf("the top %d words (out of %d) are:\n", top_words, total_unique_words);

    int iterations =0;
    for(i =0; i <= hash->tablesize && iterations< top_words; i++)
    {
          present = hash->table[i];
          if(present != NULL)
          {
              printf("     %4d %s\n", present->count, present->s);
              present = present->next;
              iterations++;
          }
    }
    freetable(hash);

 return 0;
}

int wordcmp (word *a, word *b) 
{
    if (a != NULL && b!= NULL) {

    if (a->count < b->count) 
    {
      return +1;     
    }
    else if (a->count > b->count) 
    {
        return -1; 
    }
    else if (a->count == b->count)
    {
      /*return strcmp(b->s, a->s);*/
      return 0;
    }
  }
  return 0;
}

/* Create a new hashtable. */
struct hashtable *ht_create( int size ) 
{
  int i;

  if( size < 1 ) 
    return NULL;

  hashtable *table = (hashtable *) malloc(sizeof(hashtable));
  table->table = (word **) malloc(sizeof(word *) * size);

  if(table != NULL)
  {
      table->currentsize = 0;
      table->tablesize = size;
  }

  for( i = 0; i < size; i++ ) 
  {
    table->table[i] = NULL;
  }

  return table; 
}

/* Adds a new node to the hash table*/
word * add(hashtable *h, char *key, int freq) 
{
    int index = hashcode(key) % h->tablesize;
    word *current = h->table[index];

    /* Search for duplicate value */
    while(current != NULL) {
        if(contains(h, key) == 1){
            current->count++;
            return current;
       }
         current = current->next;
     }

    /* Create new node if no duplicate is found */
    word *newnode = (struct word*)malloc(sizeof(struct word));
    if(newnode!=NULL){
          newnode->s =strdup(key);
          newnode-> count = freq;
          newnode-> next = NULL;
    }
    h->table[index] = newnode;
    h->currentsize = h->currentsize + 1;
    total_unique_words++;
    return newnode;
}

The primary problem you are facing is attempting to sort a hashtable with linked-list chaining of buckets. When a hash collision occurs, your table is not resized, you simply use a linked-list to store the word causing the collision at the same table[index] linked to the word already stored there. That is what add does.

This can easily result in the contents of your hashtable looking like this:

table[ 0] = NULL
table[ 1] = foo
table[ 2] = NULL
table[ 3] = |some|->|words|->|that|->|collided|  /* chained bucket */
table[ 4] = other
table[ 5] = words
table[ 6] = NULL
table[ 7] = NULL
...

You cannot simply qsort table and hope to get the correct word frequencies. qsort has no way to know that "some" is just the beginning word in a linked-list, all qsort gets is a pointer to "some" and sizeof(word) .

To make life much easier, simply forget the hashtable, and use a dynamically allocated array of word** . You can use a similar add where you increment the number of occurrences for duplicates, and you avoid all problems with chained-buckets. (and if you provide automatic storage for each word, it leaves you with a simple free() of your pointers and you are done)

The following example takes 2 arguments. The first the filename to read words from, and (optionally) a second integer value limiting the sorted output to the that top number of words. The words_t struct uses automatic storage for word limited to 32-chars (the largest word in the unabridged dictionary is 28-characters). You can change the way words or read to parse the input and ignore punctuation and plurals as desired. The following delimits words on all punctuation (except the hyphen), and discards the plural form of words (eg it stores "Mike" when "Mike's" is encountered, discarding the "'s" )

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <errno.h>

#define MAXC   32   /* max word length is 28-char, 29-char is sufficient */
#define MAXW  128   /* initial maximum number of words to allocate */

typedef struct {
    char word[MAXC];    /* struct holding individual words */
    size_t ninst;       /* and the number of times they occur */
} words_t;

/*  function prototypes */
void *addword (words_t *words, const char *word, size_t *wc, size_t *maxw);
void *xrealloc (void *ptr, size_t psz, size_t *nelem);

/* qsort compare function for words_t (alphabetical) */
int cmpwrds (const void *a, const void *b)
{
    return strcmp (((words_t *)a)->word, ((words_t *)b)->word);
}

/* qsort compare function for words_t (by occurrence - descending)
 * and alphabetical (ascending) if occurrences are equal)
 */
int cmpinst (const void *a, const void *b)
{
    int ndiff =  (((words_t *)a)->ninst < ((words_t *)b)->ninst) - 
                (((words_t *)a)->ninst > ((words_t *)b)->ninst);

    if (ndiff)
        return ndiff;

    return strcmp (((words_t *)a)->word, ((words_t *)b)->word);
}

int main (int argc, char **argv) {

    int c = 0, nc = 0, prev = ' ', total = 0;
    size_t maxw = MAXW, wc = 0, top = 0;
    char buf[MAXC] = "";
    words_t *words = NULL;
    FILE *fp = fopen (argv[1], "r");

    if (!fp) {  /* validate file open for reading */
        fprintf (stderr, "error: file open failed '%s'.\n", argv[1]);
        return 1;
    }

    if (argc > 2) { /* if 2 args, convert argv[2] to number of top words */
        char *p = argv[2];
        size_t tmp = strtoul (argv[2], &p, 0);
        if (p != argv[2] && !errno)
            top = tmp;
    }

    /* allocate/validate initial words */
    if (!(words = calloc (maxw, sizeof *words))) {
        perror ("calloc-words");
        return 1;
    }

    while ((c = fgetc(fp)) != EOF) {        /* read each character in file */
        if (c != '-' && (isspace (c) || ispunct (c))) { /* word-end found */
            if (!isspace (prev) && !ispunct (prev) &&   /* multiple ws/punct */
                !(prev == 's' && nc == 1)) {            /* exclude "'s" */
                buf[nc] = 0;                            /* nul-terminate */
                words = addword (words, buf, &wc, &maxw);   /* add word */
                nc = 0;     /* reset char count */
            }
        }
        else if (nc < MAXC - 1) {   /* add char to buf */
            buf[nc++] = c;
        }
        else {  /* chars exceed MAXC - 1; storage capability of struct */
            fprintf (stderr, "error: characters exceed %d.\n", MAXC);
            return 1;
        }
        prev = c;   /* save previous char */
    }
    if (!isspace (prev) && !ispunct (prev))     /* handle non-POSIX end */
        words = addword (words, buf, &wc, &maxw);

    if (fp != stdin) fclose (fp);   /* close file if not stdin */

    qsort (words, wc, sizeof *words, cmpinst);  /* sort words by frequency */

    printf ("'%s' contained '%zu' words.\n\n",  /* output total No. words */
            fp == stdin ? "stdin" : argv[1], wc);

    /* output top words (or all words in descending order if top not given) */
    for (size_t i = 0; i < (top != 0 ? top : wc); i++) {
        printf ("  %-28s    %5zu\n", words[i].word, words[i].ninst);
        total += words[i].ninst;
    }
    printf ("%33s------\n%34s%5d\n", " ", "Total: ", total);

    free (words);

    return 0;
}

/** add word to words, updating pointer to word-count 'wc' and
 *  the maximum words allocated 'maxw' as needed. returns pointer
 *  to words (which must be assigned back in the caller).
 */
void *addword (words_t *words, const char *word, size_t *wc, size_t *maxw)
{
    size_t i;

    for (i = 0; i < *wc; i++)
        if (strcmp (words[i].word, word) == 0) {
            words[i].ninst++;
            return words;
        }

    if (*wc == *maxw)
        words = xrealloc (words, sizeof *words, maxw);

    strcpy (words[*wc].word, word);
    words[(*wc)++].ninst++;

    return words;
}

/** realloc 'ptr' of 'nelem' of 'psz' to 'nelem * 2' of 'psz'.
 *  returns pointer to reallocated block of memory with new
 *  memory initialized to 0/NULL. return must be assigned to
 *  original pointer in caller.
 */
void *xrealloc (void *ptr, size_t psz, size_t *nelem)
{   void *memptr = realloc ((char *)ptr, *nelem * 2 * psz);
    if (!memptr) {
        perror ("realloc(): virtual memory exhausted.");
        exit (EXIT_FAILURE);
    }   /* zero new memory (optional) */
    memset ((char *)memptr + *nelem * psz, 0, *nelem * psz);
    *nelem *= 2;
    return memptr;
}

( note: the output is sorted in descending order of occurrence, and in alphabetical order if words have the same number of occurrences)

Example Use/Output

$ ./bin/getchar_wordcnt_top dat/damages.txt 10
'dat/damages.txt' contained '109' words.

  the                                12
  a                                  10
  in                                  7
  of                                  7
  and                                 5
  anguish                             4
  injury                              4
  jury                                4
  mental                              4
  that                                4
                                 ------
                           Total:    61

Note: to use your hashtable as your basis for storage, you would have to, at minimum, create an array of pointers to each word in your hashtable, and then sort the array of pointers. Otherwise you would need to duplicate storage and copy the words to a new array to sort. (that would be somewhat a memory inefficient approach). Creating a separate array of pointers to each word in your hashtable to sort is about the only way you have to then call qsort and avoid the chained-bucket problem.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM