简体   繁体   English

从 CSV 文件中读取并分离字段以存储在 C 中的结构中

[英]Reading from a CSV file and separating the fields to store in a struct in C

I am trying to read from a CSV file and store each field to a variable inside a struct.我正在尝试从 CSV 文件中读取并将每个字段存储到结构内的变量中。 I am using fgets and strtok to separate each field.我使用 fgets 和 strtok 来分隔每个字段。 However, I cannot handle a special field which includes comma inside the field.但是,我无法处理在字段内包含逗号的特殊字段。

typedef struct {
    char name[20+1];
    char surname[20+1];
    char uniqueId[10+1];
    char address[150+1];
} employee_t;

void readFile(FILE *fp, employee_t *employees[]){
    int i=0;
    char buffer[205];
    char *tmp;
    
    while (fgets(buffer,205,fp) != NULL) {
        employee_t *new = (employee_t *)malloc(sizeof(*new));
        
        tmp = strtok(buffer,",");
        strcpy(new->name,tmp);
        
        tmp = strtok(buffer,",");
        strcpy(new->surname,tmp);
        
        tmp = strtok(buffer,",");
        strcpy(new->uniqueId,tmp);

        tmp = strtok(buffer,",");
        strcpy(new->address,tmp);

        employees[i++] = new;
        free(new);
    }
}

The inputs are as follows:输入如下:

Jim,Hunter,9239234245,"8/1 Hill Street, New Hampshire"
Jay,Rooney,92364434245,"122 McKay Street, Old Town"
Ray,Bundy,923912345,NOT SPECIFIED

I tried printing the tokens with this code and I get this:我尝试使用此代码打印令牌并得到以下信息:

Jim 
Hunter 
9239234245
"8/1 Hill Street
 New Hampshire"

I am not sure how to handle the address field, since some of them might have a comma inside them.我不确定如何处理地址字段,因为其中一些字段中可能有逗号。 I tried reading character by character but not sure how to insert the strings in the struct using a single loop.我尝试逐个字符读取,但不确定如何使用单个循环在结构中插入字符串。 Can someone help me with some ideas on how to fix this?有人可以帮助我提供一些有关如何解决此问题的想法吗?

strcspn can be used to find either double quotes or double quote plus comma. strcspn可用于查找双引号或双引号加逗号。
The origial string is not modified so string literals can be utilized.原始字符串未修改,因此可以使用字符串文字。
The position of the double quotes is not significant.双引号的位置并不重要。 They can be in any field.他们可以在任何领域。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main( void) {

    char *string[] = {
        "Jim,Hunter,9239234245,\"8/1 Hill Street, New Hampshire\""
        , "Jay,Rooney,92364434245,\"122 McKay Street, Old Town\""
        , "Ray,Bundy,923912345,NOT SPECIFIED"
        , "Ray,Bundy,\" double quote here\",NOT SPECIFIED"
    };

    for ( int each = 0; each < 4; ++each) {
        char *token = string[each];
        char *p = string[each];

        while ( *p) {
            if ( '\"' == *p) {//at a double quote
                p += strcspn ( p + 1, "\"");//advance to next double quote
                p += 2;//to include the opening and closing double quotes
            }
            else {
                p += strcspn ( p, ",\"");//advance to a comma or double quote
            }
            int span = ( int)( p - token);
            if ( span) {
                printf ( "token:%.*s\n", span, token);//print span characters

                //copy to another array
            }
            if ( *p) {//not at terminating zero
                ++p;//do not skip consecutive delimiters

                token = p;//start of next token
            }
        }
    }
    return 0;
}

EDIT: copy to variables编辑:复制到变量
A counter can be used to keep track of fields as they are processed.计数器可用于在处理字段时对其进行跟踪。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#define SIZENAME 21
#define SIZEID 11
#define SIZEADDR 151

typedef struct {
    char name[SIZENAME];
    char surname[SIZENAME];
    char uniqueId[SIZEID];
    char address[SIZEADDR];
} employee_t;

int main( void) {

    char *string[] = {
        "Jim,Hunter,9239234245,\"8/1 Hill Street, New Hampshire\""
        , "Jay,Rooney,92364434245,\"122 McKay Street, Old Town\""
        , "Ray,Bundy,923912345,NOT SPECIFIED"
        , "Ray,Bundy,\"quote\",NOT SPECIFIED"
    };
    employee_t *employees = malloc ( sizeof *employees * 4);
    if ( ! employees) {
        fprintf ( stderr, "problem malloc\n");
        return 1;
    }

    for ( int each = 0; each < 4; ++each) {
        char *token = string[each];
        char *p = string[each];
        int field = 0;

        while ( *p) {
            if ( '\"' == *p) {
                p += strcspn ( p + 1, "\"");//advance to a delimiter
                p += 2;//to include the opening and closing double quotes
            }
            else {
                p += strcspn ( p, ",\"");//advance to a delimiter
            }
            int span = ( int)( p - token);
            if ( span) {
                ++field;
                if ( 1 == field) {
                    if ( span < SIZENAME) {
                        strncpy ( employees[each].name, token, span);
                        employees[each].name[span] = 0;
                        printf ( "copied:%s\n", employees[each].name);//print span characters
                    }
                }
                if ( 2 == field) {
                    if ( span < SIZENAME) {
                        strncpy ( employees[each].surname, token, span);
                        employees[each].surname[span] = 0;
                        printf ( "copied:%s\n", employees[each].surname);//print span characters
                    }
                }
                if ( 3 == field) {
                    if ( span < SIZEID) {
                        strncpy ( employees[each].uniqueId, token, span);
                        employees[each].uniqueId[span] = 0;
                        printf ( "copied:%s\n", employees[each].uniqueId);//print span characters
                    }
                }
                if ( 4 == field) {
                    if ( span < SIZEADDR) {
                        strncpy ( employees[each].address, token, span);
                        employees[each].address[span] = 0;
                        printf ( "copied:%s\n", employees[each].address);//print span characters
                    }
                }
            }
            if ( *p) {//not at terminating zero
                ++p;//do not skip consceutive delimiters

                token = p;//start of next token
            }
        }
    }
    free ( employees);
    return 0;
}

In my view, this kind of problem calls for a "proper" tokenizer, perhaps based on a finite state machine (FSM).在我看来,这种问题需要一个“合适的”分词器,可能基于有限状态机(FSM)。 In this case you'd scan the input string character by character, assigning each character to a class.在这种情况下,您将逐个字符扫描输入字符串,将每个字符分配给一个类。 The tokenizer would start in a particular state and, according to the class of the character read, it might stay in the same state, or move to a new state.分词器将在特定状态下启动,并且根据读取的字符的类别,它可能保持相同状态,或移动到新状态。 That is, the state transitions are controlled by the combination of the current state and the character under consideration.也就是说,状态转换由当前状态和所考虑的角色的组合控制。

For example, if you read a double-quote in the starting state, you transition to the "in a quoted string" state.例如,如果您在起始状态中读取双引号,则会转换到“在带引号的字符串中”状态。 In that state, the comma would not cause a transition to a new state -- it would just get added to the token you're building.在那种状态下,逗号不会导致转换到新状态——它只会被添加到您正在构建的令牌中。 In any other state, the comma would have a particular significance, as denoting the end of a token.在任何其他状态下,逗号将具有特殊意义,表示标记的结尾。 You'd have to figure out when you needed to swallow additional whitespace between tokens, whether there was some "escape" that allowed a double-quote to be used in some other token, whether you could escape the end-of-line to make longer lines, and so on.您必须弄清楚何时需要在标记之间吞下额外的空格,是否有一些“转义”允许在其他标记中使用双引号,您是否可以转义行尾更长的线,等等。

The important point is that, if you implement this is an FSM (or another, real tokenizer) you actually can consider all these things, and implement them as you need.重要的一点是,如果您实现的是 FSM(或另一个真正的标记器),您实际上可以考虑所有这些事情,并根据需要实现它们。 If you use ad-hoc applications of strtok() and string searching, you can't -- not in an elegant, maintainable way, anyway.如果您使用 strtok() 和字符串搜索的临时应用程序,则不能——无论如何都不能以优雅的、可维护的方式。

And if, one day, you end up needing to do the whole job using wide characters, that's easy -- just convert the input into wide characters and iterate it one wide character (not byte) at a time.如果有一天,您最终需要使用宽字符完成整个工作,那很容易——只需将输入转换为宽字符并一次迭代一个宽字符(而不是字节)。

It's easy to document the behaviour of an FSM parser using a state transition diagram -- easier, at least, that trying to explain it by documenting code in text.使用状态转换图记录 FSM 解析器的行为很容易——至少,比试图通过记录文本代码来解释它更容易。

My experience is that the first time somebody implements an FSM tokenizer, it's horrible.我的经验是,有人第一次实现 FSM 标记器时,这很可怕。 After that, it's easy.之后,这很容易。 And you can use the same technique to parse input of much greater complexity when you know the method.当您知道该方法时,您可以使用相同的技术来解析更复杂的输入。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM