简体   繁体   中英

fscanf with whitespaces as separators - what format should I use?

I have a txt file that its lines are as follows

[7 chars string][whitespace][5 chars string][whitespace][integer]

I want to use fscanf() to read all these into memory, and I'm confused about what format should I use.

Here's an example of such line:

hello   box   94324

Notice the filling whitespaces in each string, apart from the separating whitespace.

Edit: I know about the recommendation to use fgets() first, I cannot use it here.

Edit: here's my code

typedef struct Product {
    char* id;   //Product ID number. This is the key of the search tree.
    char* productName;  //Name of the product.
    int currentQuantity;    //How many items are there in stock, currently. 
} Product;

int main()
{
    FILE *initial_inventory_file = NULL;
    Product product = { NULL, NULL, 0 };

    //open file 
    initial_inventory_file = fopen(INITIAL_INVENTORY_FILE_NAME, "r");

    product.id = malloc(sizeof(char) * 10); //- Product ID: 9 digits exactly. (10 for null character)
    product.productName = malloc(sizeof(char) * 11); //- Product name: 10 chars exactly.

    //go through each line in inital inventory
    while (fscanf(initial_inventory_file, "%9c %10c %i", product.id, product.productName, &product.currentQuantity) != EOF)
    {
        printf("%9c %10c %i\n", product.id, product.productName, product.currentQuantity);
    }

    //cleanup...
    ...
}

Here's a file example: (it's actually 10 chars, 9 chars, and int)

022456789 box-large  1234
023356789 cart-small 1234
023456789 box        1234
985477321 dog food   2
987644421 cat food   5555
987654320 snaks      4444
987654321 crate      9999
987654322 pillows    44

Assuming your input file is well-formed, this is the most straightforward version:

char str1[8] = {0};
char str2[6] = {0};
int  val;
...
int result = fscanf( input, "%7s %5s %d", str1, str2, &val );

If result is equal to 3, you successfully read all three inputs. If it's less than 3 but not EOF , then you had a matching failure on one or more of your inputs. If it's EOF , you've either hit the end of the file or there was an input error; use feof( input ) to test for EOF at that point.

If you can't guarantee your input file is well-formed (which most of us can't), you're better off reading in the entire line as text and parsing it yourself. You said you can't use fgets , but there's a way to do it with fscanf :

char buffer[128]; // or whatever size you think would be appropriate to read a line at a time

/**
 * " %127[^\n]" tells scanf to skip over leading whitespace, then read
 * up to 127 characters or until it sees a newline character, whichever
 * comes first; the newline character is left in the input stream.
 */
if ( fscanf( input, " %127[^\n]", buffer ) == 1 )
{
  // process buffer
}

You can then parse the input buffer using sscanf :

int result = sscanf( buffer, "%7s %5s %d", str1, str2, &val );
if ( result == 3 )
{
  // process inputs
}
else
{
  // handle input error
}

or by some other method.

EDIT

Edge cases to watch out for:

  1. Missing one or more inputs per line
  2. Malformed input (such as non-numeric text in the integer field)
  3. More than one set of inputs per line
  4. Strings that are longer than 7 or 5 characters
  5. Value too large to store in an int

EDIT 2

The reason most of us don't recommend fscanf is because it sometimes makes error detection and recovery difficult. For example, suppose you have the input records

foo     bar    123r4
blurga  blah   5678

and you read it with fscanf( input, "%7s %5s %d", str1, str2, &val ); . fscanf will read 123 and assign it to val , leaving r4 in the input stream. On the next call, r4 will get assigned to str1 , blurga will get assigned to str2 , and you'll get a matching failure on blah . Ideally you'd like to reject the whole first record, but by the time you know there's a problem it's too late.

If you read it as a string first , you can parse and check each field, and if any of them are bad, you can reject the whole thing.

The issue in your code using the "%9c ..." -format is that %9c does not write the string terminating character. So your string is probably filled with garbage and not terminated at all, which leads to undefined behaviour when printing it out using printf .

If you set the complete content of the strings to 0 before the first scan, it should work as intended. To achieve this, you can use calloc instead of malloc ; this will initialise the memory with 0 .

Note that the code also has to somehow consumes the newline character, which is solved by an additional fscanf(f,"%*c") -statement (the * indicates that the value is consumed, but not stored to a variable). Will work only if there are no other white spaces between the last digit and the newline character:

int main()
{
    FILE *initial_inventory_file = NULL;
    Product product = { NULL, NULL, 0 };

    //open file
    initial_inventory_file = fopen(INITIAL_INVENTORY_FILE_NAME, "r");

    product.id = calloc(sizeof(char), 10); //- Product ID: 9 digits exactly. (10 for null character)
    product.productName = calloc(sizeof(char), 11); //- Product name: 10 chars exactly.

    //go through each line in inital inventory
    while (fscanf(initial_inventory_file, "%9c %10c %i", product.id, product.productName, &product.currentQuantity) == 3)
    {
        printf("%9s %10s %i\n", product.id, product.productName, product.currentQuantity);
        fscanf(initial_inventory_file,"%*c");
    }

    //cleanup...
}

Let's assume the input is

<LWS>* <first> <LWS>+ <second> <LWS>+ <integer>

where <LWS> is any whitespace character, including newlines; <first> has one to seven non-whitespace characters; <second> has one to five non-wihitespace characters; <integer> is an optionally signed integer (in hexadecimal if it begins with 0x or 0X , in octal if it begins with 0 , or in decimal otherwise); * indicates zero or more of the preceding element; and + indicates one or more of the preceding element.

Let's say you have a structure,

struct record {
    char first[8];  /* 7 characters + end-of-string '\0' */
    char second[6]; /* 5 characters + end-of-string '\0' */
    int  number;
};

then you can read the next record from stream in into the structure pointed to by the caller using eg

#include <stdlib.h>
#include <stdio.h>

/* Read a record from stream 'in' into *'rec'.
   Returns: 0 if success
           -1 if invalid parameters
           -2 if read error
           -3 if non-conforming format
           -4 if bug in function
           +1 if end of stream (and no data read)
*/
int read_record(FILE *in, struct record *rec)
{
    int rc;

    /* Invalid parameters? */
    if (!in || !rec)
        return -1;

    /* Try scanning the record. */
    rc = fscanf(in, " %7s %5s %d", rec->first, rec->second, &(rec->number));

    /* All three fields converted correctly? */
    if (rc == 3)
        return 0; /* Success! */

    /* Only partially converted? */
    if (rc > 0)
        return -3;

    /* Read error? */
    if (ferror(in))
        return -2;

    /* End of input encountered? */
    if (feof(in))
        return +1;

    /* Must be a bug somewhere above. */
    return -4;
}

The conversion specifier %7s converts up to seven non-whitespace characters, and %5s up to five; the array (or char pointer) must have room for an additional end-of-string nul byte, '\\0' , which the scanf() family of functions add automatically.

If you do not specify the length limit, and use %s , the input can overrun the specified buffer. This is a common cause for the common buffer overflow bug.

The return value from the scanf() family of functions is the number of successful conversions (possibly 0 ), or EOF if an error occurs. Above, we need three conversions to fully scan a record. If we scan just 1 or 2, we have a partial record. Otherwise, we check if a stream error occurred, by checking ferror() . (Note that you want to check ferror() before feof() , because an error condition may also set feof() .) If not, we check if the scanning function encountered end-of-stream before anything was converted, using feof() .

If none of the above cases were met, then the scanning function returned zero or negative without neither ferror() or feof() returning true. Because the scanning pattern starts with (whitespace and) a conversion specifier, it should never return zero. The only nonpositive return value from the scanf() family of functions is EOF , which should cause feof() to return true. So, if none of the above cases were met, there must be a bug in the code, triggered by some odd corner case in the input.

A program that reads structures from some stream into a dynamically allocated buffer typically implements the following pseudocode:

Set ptr = NULL  # Dynamically allocated array
Set num = 0     # Number of entries in array
Set max = 0     # Number of entries allocated for in array

Loop:

    If (num >= max):
        Calculate new max; num + 1 or larger
        Reallocate ptr
        If reallocation failed:
            Report out of memory
            Abort program
        End if
    End if

    rc = read_record(stream, ptr + num)
    If rc == 1:
        Break out of loop
    Else if rc != 0:
        Report error (based on rc)
        Abort program
    End if
End Loop

Have you tried the format specifiers?

char seven[8] = {0};
char five[6] = {0};
int myInt = 0;

// loop here
fscanf(fp, "%s %s %d", seven, five, &myInt);
// save to structure / do whatever you want

If you're sure that the formatting and strings are the always fixed length, you could also iterate over input character by character (using something like fgetc() and manually process it. The example above could cause segmentation errors if the string in the file exceeds 5 or 7 characters.

EDIT Manual Scanning Loop:

char seven[8] = {0};
char five[6] = {0};
int myInt = 0;

// loop this part
for (int i = 0; i < 7; i++) {
    seven[i] = fgetc(fp);
}
assert(fgetc(fp) == ' '); // consume space (could also use without assert)
for (int i = 0; i < 5; i++) {
    five[i] = fgetc(fp);
}
assert(fgetc(fp) == ' '); // consume space (could also use without assert)
fscanf(fp, "%d", &myInt);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM