简体   繁体   中英

Reading from a CSV file and separating the fields to store in a struct in C

I am trying to read from a CSV file and store each field to a variable inside a struct. I am using fgets and strtok to separate each field. However, I cannot handle a special field which includes comma inside the field.

typedef struct {
    char name[20+1];
    char surname[20+1];
    char uniqueId[10+1];
    char address[150+1];
} employee_t;

void readFile(FILE *fp, employee_t *employees[]){
    int i=0;
    char buffer[205];
    char *tmp;
    
    while (fgets(buffer,205,fp) != NULL) {
        employee_t *new = (employee_t *)malloc(sizeof(*new));
        
        tmp = strtok(buffer,",");
        strcpy(new->name,tmp);
        
        tmp = strtok(buffer,",");
        strcpy(new->surname,tmp);
        
        tmp = strtok(buffer,",");
        strcpy(new->uniqueId,tmp);

        tmp = strtok(buffer,",");
        strcpy(new->address,tmp);

        employees[i++] = new;
        free(new);
    }
}

The inputs are as follows:

Jim,Hunter,9239234245,"8/1 Hill Street, New Hampshire"
Jay,Rooney,92364434245,"122 McKay Street, Old Town"
Ray,Bundy,923912345,NOT SPECIFIED

I tried printing the tokens with this code and I get this:

Jim 
Hunter 
9239234245
"8/1 Hill Street
 New Hampshire"

I am not sure how to handle the address field, since some of them might have a comma inside them. I tried reading character by character but not sure how to insert the strings in the struct using a single loop. Can someone help me with some ideas on how to fix this?

strcspn can be used to find either double quotes or double quote plus comma.
The origial string is not modified so string literals can be utilized.
The position of the double quotes is not significant. They can be in any field.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main( void) {

    char *string[] = {
        "Jim,Hunter,9239234245,\"8/1 Hill Street, New Hampshire\""
        , "Jay,Rooney,92364434245,\"122 McKay Street, Old Town\""
        , "Ray,Bundy,923912345,NOT SPECIFIED"
        , "Ray,Bundy,\" double quote here\",NOT SPECIFIED"
    };

    for ( int each = 0; each < 4; ++each) {
        char *token = string[each];
        char *p = string[each];

        while ( *p) {
            if ( '\"' == *p) {//at a double quote
                p += strcspn ( p + 1, "\"");//advance to next double quote
                p += 2;//to include the opening and closing double quotes
            }
            else {
                p += strcspn ( p, ",\"");//advance to a comma or double quote
            }
            int span = ( int)( p - token);
            if ( span) {
                printf ( "token:%.*s\n", span, token);//print span characters

                //copy to another array
            }
            if ( *p) {//not at terminating zero
                ++p;//do not skip consecutive delimiters

                token = p;//start of next token
            }
        }
    }
    return 0;
}

EDIT: copy to variables
A counter can be used to keep track of fields as they are processed.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZENAME 21
#define SIZEID 11
#define SIZEADDR 151

typedef struct {
    char name[SIZENAME];
    char surname[SIZENAME];
    char uniqueId[SIZEID];
    char address[SIZEADDR];
} employee_t;

int main( void) {

    char *string[] = {
        "Jim,Hunter,9239234245,\"8/1 Hill Street, New Hampshire\""
        , "Jay,Rooney,92364434245,\"122 McKay Street, Old Town\""
        , "Ray,Bundy,923912345,NOT SPECIFIED"
        , "Ray,Bundy,\"quote\",NOT SPECIFIED"
    };
    employee_t *employees = malloc ( sizeof *employees * 4);
    if ( ! employees) {
        fprintf ( stderr, "problem malloc\n");
        return 1;
    }

    for ( int each = 0; each < 4; ++each) {
        char *token = string[each];
        char *p = string[each];
        int field = 0;

        while ( *p) {
            if ( '\"' == *p) {
                p += strcspn ( p + 1, "\"");//advance to a delimiter
                p += 2;//to include the opening and closing double quotes
            }
            else {
                p += strcspn ( p, ",\"");//advance to a delimiter
            }
            int span = ( int)( p - token);
            if ( span) {
                ++field;
                if ( 1 == field) {
                    if ( span < SIZENAME) {
                        strncpy ( employees[each].name, token, span);
                        employees[each].name[span] = 0;
                        printf ( "copied:%s\n", employees[each].name);//print span characters
                    }
                }
                if ( 2 == field) {
                    if ( span < SIZENAME) {
                        strncpy ( employees[each].surname, token, span);
                        employees[each].surname[span] = 0;
                        printf ( "copied:%s\n", employees[each].surname);//print span characters
                    }
                }
                if ( 3 == field) {
                    if ( span < SIZEID) {
                        strncpy ( employees[each].uniqueId, token, span);
                        employees[each].uniqueId[span] = 0;
                        printf ( "copied:%s\n", employees[each].uniqueId);//print span characters
                    }
                }
                if ( 4 == field) {
                    if ( span < SIZEADDR) {
                        strncpy ( employees[each].address, token, span);
                        employees[each].address[span] = 0;
                        printf ( "copied:%s\n", employees[each].address);//print span characters
                    }
                }
            }
            if ( *p) {//not at terminating zero
                ++p;//do not skip consceutive delimiters

                token = p;//start of next token
            }
        }
    }
    free ( employees);
    return 0;
}

In my view, this kind of problem calls for a "proper" tokenizer, perhaps based on a finite state machine (FSM). In this case you'd scan the input string character by character, assigning each character to a class. The tokenizer would start in a particular state and, according to the class of the character read, it might stay in the same state, or move to a new state. That is, the state transitions are controlled by the combination of the current state and the character under consideration.

For example, if you read a double-quote in the starting state, you transition to the "in a quoted string" state. In that state, the comma would not cause a transition to a new state -- it would just get added to the token you're building. In any other state, the comma would have a particular significance, as denoting the end of a token. You'd have to figure out when you needed to swallow additional whitespace between tokens, whether there was some "escape" that allowed a double-quote to be used in some other token, whether you could escape the end-of-line to make longer lines, and so on.

The important point is that, if you implement this is an FSM (or another, real tokenizer) you actually can consider all these things, and implement them as you need. If you use ad-hoc applications of strtok() and string searching, you can't -- not in an elegant, maintainable way, anyway.

And if, one day, you end up needing to do the whole job using wide characters, that's easy -- just convert the input into wide characters and iterate it one wide character (not byte) at a time.

It's easy to document the behaviour of an FSM parser using a state transition diagram -- easier, at least, that trying to explain it by documenting code in text.

My experience is that the first time somebody implements an FSM tokenizer, it's horrible. After that, it's easy. And you can use the same technique to parse input of much greater complexity when you know the method.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM