简体   繁体   English

将文件逐行读取到 C 中的字符串数组中

[英]Reading a file line-by-line into an array of strings in C

I'm trying to read the following file line by line into an array of strings where each line is an element of the array:我正在尝试将以下文件逐行读取到字符串数组中,其中每一行都是数组的一个元素:

AATGC
ATGCC
GCCGT
CGTAC
GTACG
TACGT
ACGTA
CGTAC
GTACG
TACGA
ACGAA

My code is as follows:我的代码如下:

void **get_genome(char *filename) {
    FILE *file = fopen(filename, "r");
    int c;
    int line_count = 0;
    int line_length = 0;
    for (c = getc(file); c != EOF; c = getc(file)) {
        if (c == '\n') line_count++;
        else line_length++;
    }
    line_length /= line_count;
    rewind(file);

    char **genome = calloc(line_length * line_count, sizeof(char));
    for (int i = 0; i < line_count; i++) {
        genome[i] = calloc(line_length, sizeof(char));
        fscanf(file, "%s\n", genome[i]);
    }

    printf("%d lines of %d length\n", line_count, line_length);

    for (int i = 0; i < line_count; i++)
        printf("%s\n", genome[i]);
}

However, for some reason I get garbage output for the first 2 elements of the array.但是,由于某种原因,我得到了数组前 2 个元素的垃圾 output 。 The following is my output:以下是我的output:

`NP��
�NP��
GCCGT
CGTAC
GTACG
TACGT
ACGTA
CGTAC
GTACG
TACGA
ACGAA

You seem to assume that all lines have the same line length.您似乎假设所有行都具有相同的行长。 If such is the case, you still have some problems:如果是这样的话,你仍然有一些问题:

  • the memory for the row pointers is allocated incorrectly, it should be行指针的 memory 分配不正确,应该是

     char **genome = calloc(line_count, sizeof(char *));

or better and less error prone:或者更好,更不容易出错:

    char **genome = calloc(line_count, sizeof(*genome));
  • the memory for each row should be one byte longer the the null terminator.每行的 memory 应该比 null 终止符长一个字节。

  • \n is the fscanf() format string matches any sequence of whitespace characters. \nfscanf()格式字符串匹配任何空白字符序列。 It is redundant as %s skips those anyway.这是多余的,因为%s无论如何都会跳过这些。

  • it is safer to count items separated by white space to avoid miscounting the items if the file contains any blank characters.如果文件包含任何空白字符,则对以空格分隔的项目进行计数会更安全,以避免对项目进行错误计数。

  • you do not close file .你不关闭file

  • you do not return the genome at the end of the function您不会在 function 末尾返回genome

  • you do not check for errors.你不检查错误。

Here is a modified version:这是修改后的版本:

void **get_genome(const char *filename) {
    FILE *file = fopen(filename, "r");
    if (file == NULL)
        return NULL;
    int line_count = 1;
    int item_count = 0;
    int item_length = -1;
    int length = 0;
    int c;
    while ((c = getc(file)) != EOF) {
        if (isspace(c)) {
            if (length == 0)
                continue;  // ignore subsequent whitespace
            item_count++;
            if (item_length < 0) {
                item_length = length;
            } else
            if (item_length != length) {
                printf("inconsistent item length on line %d\", line_count);
                fclose(file);
                return NULL;
            }
            length = 0;
        } else {   
            length++;
        }
    }
    if (length) {
        printf("line %d truncated\n", line_count);
        fclose(file);
        return NULL;
    }
    rewind(file);

    char **genome = calloc(item_count, sizeof(*genome));
    if (genome == NULL) {
        printf("out of memory\n");
        fclose(file);
        return NULL;
    }
    for (int i = 0; i < item_count; i++) {
        genome[i] = calloc(item_length + 1, sizeof(*genome[i]));
        if (genome[i] == NULL) {
            while (i > 0) {
                free(genome[i]);
            }
            free(genome);
            printf("out of memory\n");
            fclose(file);
            return NULL;
        }
        fscanf(file, "%s", genome[i]);
    }
    fclose(file);

    printf("%d items of %d length on %d lines\n",
           item_count, item_length, line_count);

    for (int i = 0; i < item_count; i++)
        printf("%s\n", genome[i]);

    return genome;
}
 char **genome = calloc(line_length * line_count, sizeof(char));

must be一定是

char **genome = calloc(line_count, sizeof(char*));

or more 'secure'或更“安全”

char **genome = calloc(line_count, sizeof(*genome));

in case you change the type of genome如果你改变基因组的类型

else the allocated block if not enough long if you are in 64b because line_count is 5 rather than 8, so you write out of it with an undefined behavior否则分配的块如果不够长,如果你在 64b,因为line_count是 5 而不是 8,所以你用未定义的行为写出它

You also need to return genome at the end of the function您还需要在 function 末尾返回基因组

It was also possible to not count the number of lines and to use realloc to increment your array when reading the file也可以不计算行数并在读取文件时使用realloc来增加数组

As I see the lines have the same length.正如我所见,线条的长度相同。 Your function should inform the caller how many lines have been read.你的 function 应该通知调用者已经读取了多少行。 There is no need of reading the file twice.无需两次读取文件。 There is no need of calloc (which is more expensive function).不需要calloc (这是更昂贵的功能)。 Always check the result of the memory allocation functions.始终检查 memory 分配函数的结果。

Here is a bit different version of the function:这是 function 的版本有点不同:

char **get_genome(char *filename, size_t *line_count) {
    FILE *file = fopen(filename, "r");
    int c;
    size_t line_length = 0;
    char **genome = NULL, **tmp;

    *line_count = 0;
    if(file)
    {
        while(1)
        {
            c = getc(file);
            if( c == EOF || c == '\n') break;
            line_length++;
        }    
        rewind(file);

        while(1)
        {
            char *line = malloc(line_length + 1);
            if(line)
            {
                if(!fgets(line, line_length + 1, file))
                {
                    free(line);
                    break;
                }
                line[line_length] = 0;
                tmp = realloc(genome, (*line_count + 1) * sizeof(*genome));
                if(tmp)
                {
                    genome = tmp;
                    genome[*line_count] = line;
                    *line_count += 1;
                }
                else
                {
                    // do some memory free magic
                }
            }
        }
        fclose(file);
    }
    return genome;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM