简体   繁体   English

C - 从文本中获取随机单词

[英]C - Get random words from text a file

I have a text file which contains a list of words in a precise order. 我有一个文本文件,其中包含精确顺序的单词列表。 I'm trying to create a function that return an array of words from this file. 我正在尝试创建一个从该文件返回单词数组的函数。 I managed to retrieve words in the same order as the file like this: 我设法以与文件相同的顺序检索单词,如下所示:

char *readDict(char *fileName) {

    int i;

    char * lines[100];
    FILE *pf = fopen ("francais.txt", "r");

    if (pf == NULL) {
        printf("Unable to open the file");
    } else {

        for (i = 0; i < 100; i++) {

            lines[i] = malloc(128);

            fscanf(pf, "%s", lines[i]);

            printf("%d: %s\n", i, lines[i]);
        }


        fclose(pf);

        return *lines;
    }

    return "NULL";
}

My question is: How can I return an array with random words from the text file; 我的问题是:如何从文本文件中返回带有随机单词的数组; Not as the file words order? 不是文件单词顺序?

The file looks like this: 该文件如下所示:

exemple1
exemple2
exemple3
exemple4

Reservoir sampling allows you to select a random number of elements from a stream of indeterminate size. 水库采样允许您从不确定大小的流中选择随机数量的元素。 Something like this could work (although untested): 这样的东西可以工作(虽然未经测试):

char **reservoir_sample(const char *filename, int count) {
    FILE *file;
    char **lines;
    char buf[LINE_MAX];
    int i, n;

    file = fopen(filename, "r");
    lines = calloc(count, sizeof(char *));
    for (n = 1; fgets(buf, LINE_MAX, file); n++) {
        if (n <= count) {
            lines[n - 1] = strdup(buf);
        } else {
            i = random() % n;
            if (i < count) {
                free(lines[i]);
                lines[i] = strdup(buf);
            }
        }
    }
    fclose(file);

    return lines;
}

This is "Algorithm R": 这是“算法R”:

  • Read the first count lines into the sample array. 将第一个count行读入样本数组。
  • For each subsequent line, replace a random element of the sample array with probability count / n , where n is the line number. 对于每个后续行,用概率count / n替换样本数组的随机元素,其中n是行号。
  • At the end, the sample contains a set of random lines. 最后,样本包含一组随机行。 (The order is not uniformly random, but you can fix that with a shuffle.) (顺序不是一成不变的,但你可以通过随机播放来解决这个问题。)

If each line of the file contains one word, one possibility would be to open the file and count the number of lines first. 如果文件的每一行包含一个单词,则有一种可能性是打开文件并首先计算行数。 Then rewind() the file stream and select a random number, sel , in the range of the number of words in the file. 然后rewind()文件流并在文件中的单词数范围内选择一个随机数sel Next, call fgets() in a loop to read sel words into a buffer. 接下来,在循环中调用fgets()以将sel单词读入缓冲区。 The last word read can be copied into an array that stores the results. 读取的最后一个单词可以复制到存储结果的数组中。 Rewind and repeat for each word desired. 倒回并重复所需的每个单词。

Here is a program that uses the /usr/share/dict/words file that is typical on Linux systems. 这是一个使用Linux系统上典型的/usr/share/dict/words文件的程序。 Note that if the number of lines in the file is greater than RAND_MAX (the largest number that can be returned by rand() ), words with greater line numbers will be ignored. 请注意,如果文件中的行数大于RAND_MAXrand()可以返回的最大数字),则将忽略具有较大行号的单词。 This number can be as small as 32767 . 这个数字可以小到32767 In the GNU C Library RAND_MAX is 2147483647 . 在GNU C库中, RAND_MAX2147483647

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#define MAX_WORD   100
#define NUM_WORDS  10

int main(void)
{
    /* Open words file */
    FILE *fp = fopen("/usr/share/dict/words", "r");

    if (fp == NULL) {
        perror("Unable to locate word list");
        exit(EXIT_FAILURE);
    }

    /* Count words in file */
    char word[MAX_WORD];
    long wc = 0;
    while (fgets(word, sizeof word, fp) != NULL) {
        ++wc;
    }

    /* Store random words in array */
    char randwords[NUM_WORDS][MAX_WORD];
    srand((unsigned) time(NULL));
    for (size_t i = 0; i < NUM_WORDS; i++) {
        rewind(fp);
        int sel = rand() % wc + 1;
        for (int j = 0; j < sel; j++) {
            if (fgets(word, sizeof word, fp) == NULL) {
                perror("Error in fgets()");
            }
        }
        strcpy(randwords[i], word);
    }

    if (fclose(fp) != 0) {
        perror("Unable to close file");
    }

    /* Display results */
    for (size_t i = 0; i < NUM_WORDS; i++) {
        printf("%s", randwords[i]);
    }

    return 0;
}

Program output: 节目输出:

biology's
lists
revamping
slitter
loftiness's
concur
solemnity's
memories
winch's
boosting

If blank lines in input are a concern, the selection loop can test for them and reset to select another word when they occur: 如果输入中的空行是一个问题,选择循环可以测试它们并重置以在它们出现时选择另一个单词:

/* Store random words in array */
char randwords[NUM_WORDS][MAX_WORD];
srand((unsigned) time(NULL));
for (size_t i = 0; i < NUM_WORDS; i++) {
    rewind(fp);
    int sel = rand() % wc + 1;
    for (int j = 0; j < sel; j++) {
        if (fgets(word, sizeof word, fp) == NULL) {
            perror("Error in fgets()");
        }
    }
    if (word[0] == '\n') {      // if line is blank
        --i;                    // reset counter
        continue;               // and select another one
    }

    strcpy(randwords[i], word);
}

Note that if a file contains only blank lines, with the above modification the program would loop forever; 请注意,如果文件包含空行,则通过上述修改,程序将永远循环; it may be safer to count the number of blank lines selected in a row and skip until some reasonable threshold is reached. 计算一行中选择的空白行数并跳过直到达到某个合理的阈值可能更安全。 Better yet to verify that at least one line of the input file is not blank during the initial line-count: 最好还是验证在初始行计数期间输入文件的至少一行不是空白:

/* Count words in file */
char word[MAX_WORD];
long wc = 0;
long nonblanks = 0;
while (fgets(word, sizeof word, fp) != NULL) {
    ++wc;
    if (word[0] != '\n') {
        ++nonblanks;
    }
}
if (nonblanks == 0) {
    fprintf(stderr, "Input file contains only blank lines\n");
    exit(EXIT_FAILURE);
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM