简体   繁体   English

C从字符串解析一个巨大的int数组

[英]C parsing a huge int array from a string

I'm trying to solve some simple problem. 我正试图解决一些简单的问题。 There is an input file equal.in . 有一个输入文件equal.in It consists of two lines: first one contains the number N of numbers in next line. 它由两行组成:第一行包含下一行中的数字N. Second line contains N numbers separated by single space. 第二行包含由单个空格分隔的N个数字。 N is not greater than 3 * 10^5, each number is not greater than 10^9. N不大于3 * 10 ^ 5,每个数不大于10 ^ 9。

I'm trying to solve this program using C language. 我正在尝试使用C语言解决这个程序。 I've already made it in python like in 1 minute or so, however i struggle in C. I've made a function read_file , which should return a pointer to the array of long numbers and also change the value of size variable. 我已经在1分钟左右的时候在python中创建了它,但是我在C中挣扎。我已经创建了一个函数read_file ,它应该返回一个指向long数组的指针,并且还会改变size变量的值。 The program runs smoothly while the number N is less than 10^4, when it's above that, the array is filled with zeroes except the first element. 程序运行平稳,而数字N小于10 ^ 4,当高于该值时,除第一个元素外,数组填充零。 What am I doing wrong? 我究竟做错了什么?

#include <stdio.h>
#include <stdbool.h>
#include <stdlib.h>
#include <string.h>
#include <limits.h>

#define MAX_NUMBERS 300000
#define MAX_VALUE 1000000000

long* read_file(char*, long*);
int number_width(int);

int main() {
    long i, size;
    long* numbers = read_file("equal.in", &size);
    printf("Size: %d\n", size);
    printf("Array: ");
    for (i = 0; i < size; i++) {
        printf("%d ", numbers[i]);
    }
    printf("\n");
    free(numbers);
    return 0;
}

long* read_file(char* filename, long* size) {
    FILE* in_file = fopen(filename, "r");
    long l1_width = number_width(MAX_NUMBERS);
    char* line1 = malloc(sizeof(char) * l1_width);
    fgets(line1, l1_width, in_file);
    char *ptr;
    *size = strtol(line1, &ptr, 10);
    free(line1);
    long* numbers = malloc(sizeof(long) * *size);
    long l2_width = (number_width(MAX_VALUE) + 1) * *size;
    char* line2 = malloc(sizeof(char) * l2_width);
    fgets(line2, l2_width, in_file);
    char* token = strtok(line2, " ");
    int i = 0;
    while (token != NULL) {
        *(numbers+i) = strtol(token, &ptr, 10);
        token = strtok(NULL, " ");
        i++;
    }
    free(line2);
    fclose(in_file);
    return numbers;
}

int number_width(int n) {
    if (n < 0) n = (n == INT_MIN) ? INT_MAX : -n;
    if (n < 10) return 1;
    if (n < 100) return 2;
    if (n < 1000) return 3;
    if (n < 10000) return 4;
    if (n < 100000) return 5;
    if (n < 1000000) return 6;
    if (n < 10000000) return 7;
    if (n < 100000000) return 8;
    if (n < 1000000000) return 9;
    return 10;
}

I've deleted the (in my opinion) unnecessary code. 我删除了(在我看来)不必要的代码。 Everything else seems to work fine. 其他一切似乎都很好。 The problem only happens with the large number on the first line. 问题只发生在第一行的大数字上。 If related, i made a simple script in python to make the file equal.in according to the rules, so the contents of the file are ok. 如果相关,我在python中创建了一个简单的脚本,根据规则使文件equal.in ,所以文件的内容是可以的。

There is an input file equal.in . 有一个输入文件equal.in It consists of two lines: first one contains the number N of numbers in next line. 它由两行组成:第一行包含下一行中的数字N. Second line contains N numbers separated by single space. 第二行包含由单个空格分隔的N个数字。 N is not greater than 3 * 10^5, each number is not greater than 10^9. N不大于3 * 10 ^ 5,每个数不大于10 ^ 9。

This makes life easy. 这让生活变得轻松。 The values fit inside 32-bit int , and 300K numbers means that you will only need 1.2M of data. 这些值适合32位int ,300K数字意味着您只需要1.2M的数据。 You can allocate that on the stack, with a fixed size array or a variable length array. 您可以使用固定大小的数组或可变长度数组在堆栈上分配它。 If you insist, you can allocate it with malloc() and free it later. 如果你坚持,你可以用malloc()分配它,稍后再释放它。

Unless you need to validate that the data meets the input specification, you can do the job quite simply: 除非您需要验证数据是否符合输入规范,否则您可以非常简单地完成工作:

#include <stdio.h>

int main(int argc, char **argv)
{
    const char *filename = "equal.in";
    if (argc == 2)
        filename = argv[1];

    FILE *fp = fopen(filename, "r");
    if (fp == 0)
    {
        fprintf(stderr, "%s: failed to open file %s for reading\n", argv[0], filename);
        return 1;
    }

    int num;
    if (fscanf(fp, "%d", &num) != 1)
    {
        fprintf(stderr, "%s: first line of file %s does not start with a number\n", argv[0], filename);
        return 1;
    }

    int data[num];
    for (int i = 0; i < num; i++)
    {
        if (fscanf(fp, "%d", &data[i]) != 1)
        {
            fprintf(stderr, "%s: failed to read entry number %d from file %s\n", argv[0], i+1, filename);
            return 1;
        }
    }

    int length = 0;
    const char *pad = "";
    for (int i = 0; i < num; i++)
    {
        length += printf("%s%d", pad, data[i]);
        pad = " ";
        if (length > 70)
        {
            putchar('\n');
            pad = "";
            length = 0;
        }
    }
    if (length > 0)
        putchar('\n');
    return 0;
}

This uses a VLA. 这使用VLA。 If your compiler doesn't have support for VLAs, then you can either go with the pessimistic approach: 如果您的编译器不支持VLA,那么您可以采用悲观的方法:

enum { MAX_ENTRIES = 300000 };
int data[MAX_ENTRIES];

or the dynamic approach: 或动态方法:

int *data = malloc(num * sizeof(*data));
if (data == 0)
{
    fprintf(stderr, "%s: failed to allocate %zu bytes of memory\n",
            argv[0], num * sizeof(*data));
    return 1;
}

…

free(data);

I created a test file with 279295 entries using a home-grown random number generator: 我使用自行生成的随机数生成器创建了一个包含279295个条目的测试文件:

$ random 10000 300000 | tee equal.in
27295
$ random -n $(<equal.in) 0 999999999 | tr '\n' ' ' >> equal.in
$ echo >> equal.in
$ wc equal.in
       2  279296 2761868 equal.in
$

The first line generates a number in the range 10000 to 300000 and writes it to both equal.in and the terminal. 第一行生成10000到300000范围内的数字,并将其写入equal.in和终端。 The second line generates that many numbers ( -n $(<equal.in) — there's a Bash-ism there) in the range 0 to 999,999,999, writing one number per line. 第二行生成许多数字( -n $(<equal.in) - 那里有一个Bash-ism),范围是0到999,999,999,每行写一个数字。 The tr command maps the newlines to blanks; tr命令将换行符映射到空格; the final echo adds a newline to the end of the file. 最后的echo在文件末尾添加换行符。 The wc reports that there are two lines and 279296 'words', meaning numbers, in the file. wc报告文件中有两行和279296个“单词”,意思是数字。

I then ran the program, and got the output: 然后我运行程序,得到输出:

670206318 31176149 386272687 414856040 825173318 954016935 485458470 922293242
795866483 253363938 844512159 323292038 103572404 373917916 142021104 264196634
957800900 482861146 26824834 849885087 789023653 432837903 583262643 117607701
397156307 281517645 721527177 397482085 226290913 94898730 493928208 935264986
408834056 561990394 846038059 431925002 487972136 227567249 578463338 840243525
…
974659784 53079688 549147388 154574314 804309064 164345737 378554521 729437495
504219874 234692365 141938083 85093023 95609608 860865295 742893260 69909938
48374552 461946331 407898852 575861228 335672877 983186286 679276932 946629117
247591685 299343487 335924507 161837591 435945210 340851167 747313445 454000003
837746407 249404999 860823559 923922564 150303869 762266074 739320218

The longest line in the output was 79 characters (excluding the newline). 输出中最长的一行是79个字符(不包括换行符)。 With the output redirected to /dev/null , it took about 0.124 seconds to read and print that data. 将输出重定向到/dev/null ,读取和打印该数据大约需要0.124秒。 With the output going to screen, it took about 0.214s. 随着输出进入屏幕,它花了大约0.214s。

Note the error reporting, including the program name in the error reports, which are written to standard error. 请注意错误报告,包括错误报告中的程序名称,这些名称将写入标准错误。 I avoided using exit() but would normally need to do so. 我避免使用exit()但通常需要这样做。 In my own code, I would replace those 4-line blocks of error reporting code with single line calls to the functions from stderr.c and stderr.h available from GitHub . 在我自己的代码中,我会将这些4行错误报告代码块替换为来自stderr.cstderr.h的函数的单行调用,这些函数可从GitHub获得 I'd probably also validate the argument list, objecting to more than one argument rather than simply ignore all the arguments. 我可能也会验证参数列表,反对多个参数而不是简单地忽略所有参数。 Like this: 像这样:

#include <stdio.h>
#include "stderr.h"

int main(int argc, char **argv)
{
    err_setarg0(argv[0]);

    const char *filename = "equal.in";
    if (argc == 2)
        filename = argv[1];
    else if (argc > 2)
        err_usage("[file]");

    FILE *fp = fopen(filename, "r");
    if (fp == 0)
        err_error("failed to open file %s for reading\n", filename);

    int num;
    if (fscanf(fp, "%d", &num) != 1)
        err_error("first line of file %s does not start with a number\n", filename);

    int data[num];
    for (int i = 0; i < num; i++)
    {
        if (fscanf(fp, "%d", &data[i]) != 1)
            err_error("failed to read entry number %d from file %s\n", i+1, filename);
    }

    int length = 0;
    const char *pad = "";
    for (int i = 0; i < num; i++)
    {
        length += printf("%s%d", pad, data[i]);
        pad = " ";
        if (length > 70)
        {
            putchar('\n');
            pad = "";
            length = 0;
        }
    }
    if (length > 0)
        putchar('\n');
    return 0;
}

Note that reporting the file name can help the user identify which file to look at. 请注意,报告文件名可以帮助用户识别要查看的文件。 That means you should almost never use a string literal for the file name passed to fopen() because you would have to repeat it in the error reporting, which isn't good. 这意味着你几乎不应该使用字符串文字作为传递给fopen()的文件名,因为你必须在错误报告中重复它,这是不好的。

The problem that you are observing is: 您正在观察的问题是:

The program runs smoothly while the number N is less than 10^4, when it's above that, the array is filled with zeroes except the first element. 程序运行平稳,而数字N小于10 ^ 4,当高于该值时,除第一个元素外,数组填充零。

It is occurring because of this: 这是因为:

long l1_width = number_width(MAX_NUMBERS);
char* line1 = malloc(sizeof(char) * l1_width);
fgets(line1, l1_width, in_file);

Here the l1_width will always be 6 because the MAX_NUMBERS is 300000 . 这里l1_width将始终为6因为MAX_NUMBERS为300000

fgets: 与fgets:

Syntax : char * fgets ( char * str, int num, FILE * stream ); 语法:char * fgets(char * str,int num,FILE * stream);

Description : Reads characters from stream and stores them as a C string into str until (num-1) characters have been read or either a newline or the end-of-file is reached, whichever happens first. 说明:从流中读取字符并将它们作为C字符串存储到str中,直到读取(num-1)个字符或者到达换行符或文件结尾,以先发生者为准。

Now, consider the scenario when the first line of your file equal.in contains number less than 10^4 ie 1000 or less, the program works fine. 现在,考虑一下当文件的第一行equal.in包含的数字小于10 ^ 4即1000或更少的情况时,程序运行正常。

So, assume the first line of file equal.in contains 1000. 因此,假设文件的第一行equal.in包含1000。

Total number of characters in the first line of file equal.in - 5 characters (1000 + newline). 第一行文件中的字符总数equal.in - 5个字符(1000 +换行符)。

For this program works fine because the memory allocated to line1 is of 6 characters and fgets reads until num - 1 (ie 5) characters. 对于这个程序工作正常,因为分配给line1的内存是6个字符, fgets读取直到num - 1 (即5)个字符。

So, while reading 1000 as the first line, fgets hits newline character. 因此,当读取1000作为第一行时, fgets换行符。

But when the number is greater than or equal to 10^4, the fgets hits num - 1 before hitting newline character as the l1_width is 6 . 但是当数字大于或等于10 ^ 4时, fgets在点击换行符之前达到num - 1 ,因为l1_width6

Now the point to note here is that the rest of characters + newline character of the first line of file equal.in is yet to be read . 现在需要注意的是,第一行文件equal.inrest of characters + newline character 尚未被读取

In next call to fgets() : 在下一次调用fgets()

fgets(line2, l2_width, in_file);

It reads the rest of characters of the first line of file equal.in and as the l2_width is a big number, fgets() hits newline of the first line of equal.in file and line2 contain remains of the first line only. 它读取第一行文件equal.in的其余字符,因为l2_width是一个大数字, fgets()命中equal.in文件第一行的换行符,而line2包含第一行的剩余部分。

And it never happens to read the second line of file equal.in when the number is greater than or equal to 10^4 in the first line of equal.in file. 当在equal.in文件的第一行中数字大于或等于10 ^ 4时, 它永远不会读取第二行文件equal.in

To fix this problem first, you need to return the width + 1 from number_width() function. 要首先解决此问题,您需要从number_width()函数返回宽度+ 1。

It should be: 它应该是:

int number_width(int n) {
    if (n < 0) n = (n == INT_MIN) ? INT_MAX : -n;
    if (n < 10) return 2;
    if (n < 100) return 3;
    if (n < 1000) return 4;
    if (n < 10000) return 5;
    if (n < 100000) return 6;
    if (n < 1000000) return 7;
    if (n < 10000000) return 8;
    if (n < 100000000) return 9;
    if (n < 1000000000) return 10;
    return 11;
}

Second, you need to allocate the buffer to line1 of l1_width+1 : 其次,您需要将缓冲区分配给l1_width+1 line1

fgets(line1, l1_width+1, in_file);

As I can see that Jonathan has already given a better way to process equal.in file. 我可以看到,Jonathan已经给出了处理equal.in文件的更好方法。 This answer is just to let you know where the problem is occurring in your code so that in future you take care of such things. 这个答案只是为了让您知道代码中出现问题的位置,以便将来您可以处理这些问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM