简体   繁体   中英

How to read columns from a text file and save to separate arrays in C?

In a practice exercise to familiarize myself with pointers, I wrote a short program in C, able to read text from a file. I would like to stick to ANSI C.

The program does its job perfectly, however I want to proceed to read columns from a text file and save to separate arrays. Similar questions have been asked, with replies using strtok , or fgets or sscanf , but when should I use one instead of the other?

Here is my commented code:

#include <stdio.h>
#include <stdlib.h>

char *read_file(char *FILE_INPUT);     /*function to read file*/

int main(int argc, char **argv) {
    char *string; // Pointer to a char 

    string = read_file("file.txt");
    if (string) {
        // Writes the string pointed to by string to the stream pointed to by stdout, and appends a new-line character to the output.
        puts(string);
        // Causes space pointed to by string to be deallocated
        free(string);
    }
    return 0;
}

//Returns a pointer to a char,
char *read_file(char *FILE_INPUT) {
    char *buffer = NULL;
    int string_size, read_size;
    FILE *input_stream = fopen(FILE_INPUT, "r");

    //Check if file exists
    if (input_stream == NULL) {
        perror (FILE_INPUT);
    }
    else if (input_stream) {
        // Seek the last byte of the file. Offset is 0 for a text file.
        fseek(input_stream, 0, SEEK_END);
        // Finds out the position of file pointer in the file with respect to starting of the file
        // We get an idea of string_size since ftell returns the last value of the file pos
        string_size = ftell(input_stream);
        // sets the file position indicator for the stream to the start of the file
        rewind(input_stream);

        // Allocate a string that can hold it all
        // malloc returns a pointer to a char, +1 to hold the NULL character
        // (char*) is the cast return type, this is extra, used for humans
        buffer = (char*)malloc(sizeof(char) * (string_size + 1));

        // Read it all in one operation, returns the number of elements successfully read,
        // Reads into buffer, up to string_size whose size is specified by sizeof(char), from the input_stream !
        read_size = fgets(buffer, sizeof(char), string_size, input_stream);

        // fread doesn't set it so put a \0 in the last position
        // and buffer is now officially a string
        buffer[string_size] = '\0';

        //string_size determined by ftell should be equal to read_size from fread
        if (string_size != read_size) {
            // Something went wrong, throw away the memory and set
            // the buffer to NULL
            free(buffer);
            buffer = NULL;
        }

        // Always remember to close the file.
        fclose(input_stream);
    }

    return buffer;
}

How can I read all columns from a text file of this format, into an array? Number of columns is fixed, but number of rows can vary.

C 08902019 1020 50 Test1
A 08902666 1040 30 Test2
B 08902768 1060 80 Test3
.
B 08902768 1060 800 Test3000
.
.

On further research, I found that fread is used to allow a program to read and write large blocks of data in a single step, so reading columns separately may not be what fread is intended to do. Thus my program implementation for this kind of job is wrong.

Should I use getc , strtok , sscanf or getline to read such a text file? I am trying to stick to good programming principles and allocate memory dynamically.


EDIT:

By correct I am mean (but not limited to) using good c programming techniques and dynamic memory allocation.

My first thought was to replace fread with fgets . Update, I am getting somewhere thanks to your help.

    // Allocate a string that can hold it all
    // malloc returns a pointer to a char, +1 to hold the NULL    character
    // (char*) is the cast return type, this is extra, used for humans
    buffer = (char*)malloc(sizeof(char) * (string_size + 1));

    while (fgets(buffer, sizeof(char) * (string_size + 1), input_stream), input_stream)) {
        printf("%s", buffer);     
    }

for the above text file prints:

C 08902019 1020 50 Test1

A 08902666 1040 30 Test2

B 08902768 1060 80 Test3

B 08902768 1060 800 Test3000

I also managed to remove the newline character from fgets() input using:

strtok(buffer, "\n"); 

Similar examples here , here and here

How can I proceed to save the columns to separate arrays?

"Best Practices" is somewhat subjective, but "fully validated, logical and readable" should always be the goal.

For reading a fixed number of fields (in your case choosing cols 1, 2, 5 as string values of unknown length) and cols 3, 4 as simple int values), you can read an unknown number of rows from a file simply by allocating storage for some reasonably anticipated number of rows of data, keeping track of how many rows are filled, and then reallocating storage, as required, when you reach the limit of the storage you have allocated.

An efficient way of handling the reallocation is to reallocate by some reasonable number of additional blocks of memory when reallocation is required (rather than making calls to realloc for every additional row of data). You can either add a fixed number of new blocks, multiply what you have by 3/2 or 2 or some other sane scheme that meets your needs. I generally just double the storage each time the allocation limit is reached.

Since you have a fixed number of fields of unknown size, you can make things easy by simply separating the five-fields with sscanf and validating that 5 conversions took place by checking the sscanf return. If you were reading an unknown number of fields, then you would just use the same reallocation scheme to handle the column-wise expansion discussed above for reading an unknown number of rows.

(there is no requirement that any row have the same number of fields in that case, but you can force a check by setting a variable containing the number of fields read with the first row, and then validating that all subsequent rows have that same number...)

As discussed in the comments, reading a line of data with a line-oriented input function like fgets or POSIX getline and then parsing the data by either tokenizing with strtok , or in this case with a fixed number of fields simply parsing with sscanf is a solid approach. It provides the benefit of allowing independent validation of (1) the read of data from the file; and (2) the parse of data into the needed values. (while less flexible, for some data sets, you can do this in a single step with fscanf , but that also injects the scanf problems for user input of what remains unread in the input buffer depending on the conversion-specifiers used...)

The easiest way to approach storage of your 5-fields is to declare a simple struct . Since the number of characters for each of the character fields is unknown, the struct members for each of these fields will be a character pointer, and the remaining fields int , eg

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define ARRSZ   2   /* use 8 or more, set to 2 here to force realloc */
#define MAXC 1024

typedef struct {
    char *col1, *col2, *col5;
    int col3, col4;
} mydata_t;

Now you can start your allocation for handling an unknown number of these by allocating for some reasonably anticipated amount (I would generally use 8 or 16 with a doubling scheme as that will grow reasonably fast), but we have chosen 2 here with #define ARRSZ 2 to make sure we force one reallocation when handling your 3-line data file. Note also we are setting a maximum number of characters per-line of #define MAXC 1024 for your data ( don't skimp on buffer size )

To get started, all we need to do is declare a buffer to hold each line, and a few variables to track the currently allocated number of structs, a line counter (to output accurate error messages) and a counter for the number of rows of data we have filled. Then when (rows_filled == allocated_array_size) you realloc , eg

int main (int argc, char **argv) {

    char buf[MAXC];
    size_t arrsz = ARRSZ, line = 0, row = 0;
    mydata_t *data = NULL;
    /* use filename provided as 1st argument (stdin by default) */
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;

    if (!fp) {  /* validate file open for reading */
        perror ("file open failed");
        return 1;
    }

    /* allocate an 'arrsz' initial number of struct */
    if (!(data = malloc (arrsz * sizeof *data))) {
        perror ("malloc-data");
        return 1;
    }

    while (fgets (buf, MAXC, fp)) {         /* read each line from file */
        char c1[MAXC], c2[MAXC], c5[MAXC];  /* temp strings for c1,2,5 */
        int c3, c4;                         /* temp ints for c3,4 */
        size_t len = strlen (buf);          /* length for validation */

        line++;     /* increment line count */

        /* validate line fit in buffer */
        if (len && buf[len-1] != '\n' && len == MAXC - 1) {
            fprintf (stderr, "error: line %zu exceeds MAXC chars.\n", line);
            return 1;
        }

        if (row == arrsz) { /* check if all pointers used */
            void *tmp = realloc (data, arrsz * 2 * sizeof *data);
            if (!tmp) {     /* validate realloc succeeded */
                perror ("realloc-data");
                break;      /* break, don't exit, data still valid */
            }
            data = tmp;     /* assign realloc'ed block to data */
            arrsz *= 2;     /* update arrsz to reflect new allocation */
        }

( note: when calling realloc , you never realloc the pointer itself , eg data = realloc (data, new_size); If realloc fails (and it does), it returns NULL which would overwrite your original pointer causing a memory leak. Always realloc with a temporary pointer, validate, then assign the new block of memory to your original pointer)

What remains is just splitting the line into our values, handling any errors in line format, adding our field values to our array of struct, increment our row/line counts and repeating until we run out of data to read, eg

        /* parse buf into fields, handle error on invalid format of line */
        if (sscanf (buf, "%1023s %1023s %d %d %1023s", 
                    c1, c2, &c3, &c4, c5) != 5) {
            fprintf (stderr, "error: invalid format line %zu\n", line);
            continue;   /* get next line */
        }

        /* allocate copy strings, assign allocated blocks to pointers */
        if (!(data[row].col1 = mystrdup (c1))) { /* validate copy of c1 */
            fprintf (stderr, "error: malloc-c1 line %zu\n", line);
            break;      /* same reason to break not exit */
        }
        if (!(data[row].col2 = mystrdup (c2))) { /* validate copy of c2 */
            fprintf (stderr, "error: malloc-c1 line %zu\n", line);
            break;      /* same reason to break not exit */
        }
        data[row].col3 = c3;    /* assign integer values */
        data[row].col4 = c4;
        if (!(data[row].col5 = mystrdup (c5))) { /* validate copy of c5 */
            fprintf (stderr, "error: malloc-c1 line %zu\n", line);
            break;      /* same reason to break not exit */
        }
        row++;      /* increment number of row pointers used */
    }
    if (fp != stdin)    /* close file if not stdin */
        fclose (fp);

    puts ("values stored in struct\n");
    for (size_t i = 0; i < row; i++)
        printf ("%-4s %-10s %4d %4d %s\n", data[i].col1, data[i].col2, 
                data[i].col3, data[i].col4, data[i].col5);

    freemydata (data, row);

    return 0;
}

And we are done (except for the memory use/error check)

Note about there are two helper functions to allocate for each string and copy each string to its allocated block of memory and assigning the starting address for that block to our pointer in our struct. mystrdup() You can use strdup() if you have it, I simply included the function to show you how to manually handle the malloc and copy. Note: how the copy is done with memcpy instead of strcpy -- Why? You already scanned forward in the string to find '\\0' when you found the length with strlen -- no need to have strcpy repeat that process again -- just use memcpy .

/* simple implementation of strdup - in the event you don't have it */
char *mystrdup (const char *s)
{
    if (!s)     /* validate s not NULL */
        return NULL;

    size_t len = strlen (s);            /* get length */
    char *sdup = malloc (len + 1);      /* allocate length + 1 */

    if (!sdup)          /* validate */
        return NULL;

    return memcpy (sdup, s, len + 1);   /* pointer to copied string */ 
}

Last helper function is freemydata() which just calls free() on each allocated block to insure you free all the memory you have allocated. It also keeps you code tidy. (you can do the same for the realloc block to move that to it's own function as well)

/* simple function to free all data when done */
void freemydata (mydata_t *data, size_t n)
{
    for (size_t i = 0; i < n; i++) {    /* free allocated strings */
        free (data[i].col1);
        free (data[i].col2);
        free (data[i].col5);
    }
    free (data);    /* free structs */
}

Putting all the pieces together would give you:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define ARRSZ   2   /* use 8 or more, set to 2 here to force realloc */
#define MAXC 1024

typedef struct {
    char *col1, *col2, *col5;
    int col3, col4;
} mydata_t;

/* simple implementation of strdup - in the event you don't have it */
char *mystrdup (const char *s)
{
    if (!s)     /* validate s not NULL */
        return NULL;

    size_t len = strlen (s);            /* get length */
    char *sdup = malloc (len + 1);      /* allocate length + 1 */

    if (!sdup)          /* validate */
        return NULL;

    return memcpy (sdup, s, len + 1);   /* pointer to copied string */ 
}

/* simple function to free all data when done */
void freemydata (mydata_t *data, size_t n)
{
    for (size_t i = 0; i < n; i++) {    /* free allocated strings */
        free (data[i].col1);
        free (data[i].col2);
        free (data[i].col5);
    }
    free (data);    /* free structs */
}

int main (int argc, char **argv) {

    char buf[MAXC];
    size_t arrsz = ARRSZ, line = 0, row = 0;
    mydata_t *data = NULL;
    /* use filename provided as 1st argument (stdin by default) */
    FILE *fp = argc > 1 ? fopen (argv[1], "r") : stdin;

    if (!fp) {  /* validate file open for reading */
        perror ("file open failed");
        return 1;
    }

    /* allocate an 'arrsz' initial number of struct */
    if (!(data = malloc (arrsz * sizeof *data))) {
        perror ("malloc-data");
        return 1;
    }

    while (fgets (buf, MAXC, fp)) {         /* read each line from file */
        char c1[MAXC], c2[MAXC], c5[MAXC];  /* temp strings for c1,2,5 */
        int c3, c4;                         /* temp ints for c3,4 */
        size_t len = strlen (buf);          /* length for validation */

        line++;     /* increment line count */

        /* validate line fit in buffer */
        if (len && buf[len-1] != '\n' && len == MAXC - 1) {
            fprintf (stderr, "error: line %zu exceeds MAXC chars.\n", line);
            return 1;
        }

        if (row == arrsz) { /* check if all pointers used */
            void *tmp = realloc (data, arrsz * 2 * sizeof *data);
            if (!tmp) {     /* validate realloc succeeded */
                perror ("realloc-data");
                break;      /* break, don't exit, data still valid */
            }
            data = tmp;     /* assign realloc'ed block to data */
            arrsz *= 2;     /* update arrsz to reflect new allocation */
        }

        /* parse buf into fields, handle error on invalid format of line */
        if (sscanf (buf, "%1023s %1023s %d %d %1023s", 
                    c1, c2, &c3, &c4, c5) != 5) {
            fprintf (stderr, "error: invalid format line %zu\n", line);
            continue;   /* get next line */
        }

        /* allocate copy strings, assign allocated blocks to pointers */
        if (!(data[row].col1 = mystrdup (c1))) { /* validate copy of c1 */
            fprintf (stderr, "error: malloc-c1 line %zu\n", line);
            break;      /* same reason to break not exit */
        }
        if (!(data[row].col2 = mystrdup (c2))) { /* validate copy of c2 */
            fprintf (stderr, "error: malloc-c1 line %zu\n", line);
            break;      /* same reason to break not exit */
        }
        data[row].col3 = c3;    /* assign integer values */
        data[row].col4 = c4;
        if (!(data[row].col5 = mystrdup (c5))) { /* validate copy of c5 */
            fprintf (stderr, "error: malloc-c1 line %zu\n", line);
            break;      /* same reason to break not exit */
        }
        row++;      /* increment number of row pointers used */
    }
    if (fp != stdin)    /* close file if not stdin */
        fclose (fp);

    puts ("values stored in struct\n");
    for (size_t i = 0; i < row; i++)
        printf ("%-4s %-10s %4d %4d %s\n", data[i].col1, data[i].col2, 
                data[i].col3, data[i].col4, data[i].col5);

    freemydata (data, row);

    return 0;
}

Now test.

Example Input File

$ cat dat/fivefields.txt
C 08902019 1020 50 Test1
A 08902666 1040 30 Test2
B 08902768 1060 80 Test3

Example Use/Output

$ ./bin/fgets_fields <dat/fivefields.txt
values stored in struct

C    08902019   1020   50 Test1
A    08902666   1040   30 Test2
B    08902768   1060   80 Test3

Memory Use/Error Check

In any code you write that dynamically allocates memory, you have 2 responsibilities regarding any block of memory allocated: (1) always preserve a pointer to the starting address for the block of memory so, (2) it can be freed when it is no longer needed.

It is imperative that you use a memory error checking program to insure you do not attempt to access memory or write beyond/outside the bounds of your allocated block, attempt to read or base a conditional jump on an uninitialized value, and finally, to confirm that you free all the memory you have allocated.

For Linux valgrind is the normal choice. There are similar memory checkers for every platform. They are all simple to use, just run your program through it.

$ valgrind ./bin/fgets_fields <dat/fivefields.txt
==1721== Memcheck, a memory error detector
==1721== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==1721== Using Valgrind-3.12.0 and LibVEX; rerun with -h for copyright info
==1721== Command: ./bin/fgets_fields
==1721==
values stored in struct

C    08902019   1020   50 Test1
A    08902666   1040   30 Test2
B    08902768   1060   80 Test3
==1721==
==1721== HEAP SUMMARY:
==1721==     in use at exit: 0 bytes in 0 blocks
==1721==   total heap usage: 11 allocs, 11 frees, 243 bytes allocated
==1721==
==1721== All heap blocks were freed -- no leaks are possible
==1721==
==1721== For counts of detected and suppressed errors, rerun with: -v
==1721== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Always confirm that you have freed all memory you have allocated and that there are no memory errors.

Look things over and let me know if you have further questions.

I want to proceed to read only certain columns of this text file.

You can do this with any input function: getc , fgets , sscanf , getline ... but you must first define exactly what you mean by certain columns .

  • columns can be defined as separated by a specific character such as , , ; or TAB, in which case strtok() is definitely not the right choice because it treats all sequences of separating characters as a single separator: hence a,,b would be seen as having only 2 columns.
  • if they are instead separated by whitespace, any sequence of spaces or tabs, strtok , strpbrk or strspn / strcspn might come in handy.

In any case, you can read the file line by line with fgets but you might have a problem with very long lines. getline is a solution, but it might not be available on all systems.

If you know what is column separator and how many columns you have you use getline with column separator and then with line separator.

Here is getline :

http://man7.org/linux/man-pages/man3/getline.3.html

It is very good because it allocates space for you, no need to know how many bytes is your column or line.

Or you just use getline as in code example in link to read whole line then you "parse" and extract columns as you wish....

If you paste exactly how you want to run program with input you show I can try write fast C program for good answer. Now it is just comment-style answer with too many words for comment :-(

Or is it somehow you cannot use library?

Although while waiting for better question I will note that you can use awk to read columns from text file but probably this is not what you want? Because what are you trying to do really?

根据数据和胆量,您可以使用scanf或使用yacc / lex创建的解析器。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM