简体   繁体   中英

Read File line by line in C mostly with Syscalls

I'm trying to read and parse a file line by line. I only want to use simple syscalls ( read , open , close , ...) and not fgets or getc because I wish to learn, in a way, fundamentals. (I looked some answers on similar questions but they all use fgets and such).

Here's what I have at the moment:a function I wrote that will store 1024 chars in a buffer from a file.

int main(void) {
    const char *filename = "file.txt";
    int fd = open(filename, O_RDONLY);
    char *buffer = malloc(sizeof (char) * 1024); 

    read(fd, buffer, 1024);        
    printf("%s", buffer);
    close(fd);
    free(buffer);    
}

How does one make a stop at a '\\n' for instance? I know that once I know where to stop, I can use lseek with the right offset to continue reading my file where I stopped.

I do not wish to store the whole file in my buffer and then parse it. I want to add a line in my buffer, then parse that line and realloc my buffer and keep on reading the file.

I was thinking of something like this but I feel like it's badly optimized and not sure where to add the lseek afterwards:

char *line = malloc(sizeof (char) * 1024);
read(fd, buffer, 1);
int i = 0;
    while(*buffer != '\n' && *buffer != '\0'){
        line[i] = *buffer;
        ++i;
        *buffer++;
        read(fd, buffer, 1); //Assuming i < 1024 and *buffer != NULL
    }


  /* lseek somewhere after, probably should make 2 for loops 
   ** One loop till file isn't completly read
   ** Another loop inside that checks if the end of the line is reached
   ** At the end of second loop lseek to where we left
   */

Thanks :)

EDIT: Title for clarifications.

If you are going to use read to read a line at a time (what fgets or getline are intended to do), you must keep track of the offset within the file after you locate each '\\n' . It is then just a matter of reading a line at a time, beginning the next read at the offset following the current.

I understand wanting to be able to use the low-level functions as well as fgets and getline . What you find is that you basically end up re-coding (in a less efficient way) what is already done in fgets and getline . But it is certainly good learning. Here is a short example:

#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

#define BUFSZ 128

ssize_t readline (char *buf, size_t sz, char *fn, off_t *offset);

int main (int argc, char **argv) {

    if (argc < 2) return 1;

    char line[BUFSZ] = {0};
    off_t offset = 0;
    ssize_t len = 0;
    size_t i = 0;

    /* using open/read, read each line in file into 'line' */
    while ((len = readline (line, BUFSZ, argv[1], &offset)) != -1)
        printf (" line[%2zu] : %s (%zd chars)\n", i++, line, len);

    return 0;
}

/* read 'sz' bytes from file 'fn' beginning at file 'offset'
   storing all chars  in 'buf', where 'buf' is terminated at
   the first newline found. On success, returns number of
   characters read, -1 on error or EOF with 0 chars read.
 */
ssize_t readline (char *buf, size_t sz, char *fn, off_t *offset)
{
    int fd = open (fn, O_RDONLY);
    if (fd == -1) {
        fprintf (stderr, "%s() error: file open failed '%s'.\n",
                __func__, fn);
        return -1;
    }

    ssize_t nchr = 0;
    ssize_t idx = 0;
    char *p = NULL;

    /* position fd & read line */
    if ((nchr = lseek (fd, *offset, SEEK_SET)) != -1)
        nchr = read (fd, buf, sz);
    close (fd);

    if (nchr == -1) {   /* read error   */
        fprintf (stderr, "%s() error: read failure in '%s'.\n",
                __func__, fn);
        return nchr;
    }

    /* end of file - no chars read
    (not an error, but return -1 )*/
    if (nchr == 0) return -1;

    p = buf;    /* check each chacr */
    while (idx < nchr && *p != '\n') p++, idx++;
    *p = 0;

    if (idx == nchr) {  /* newline not found  */
        *offset += nchr;

        /* check file missing newline at end */
        return nchr < (ssize_t)sz ? nchr : 0;
    }

    *offset += idx + 1;

    return idx;
}

Example Input

The following datafiles are identical except the second contains a blank line between each line of text.

$ cat dat/captnjack.txt
This is a tale
Of Captain Jack Sparrow
A Pirate So Brave
On the Seven Seas.

$ cat dat/captnjack2.txt
This is a tale

Of Captain Jack Sparrow

A Pirate So Brave

On the Seven Seas.

Output

$ ./bin/readfile dat/captnjack.txt
 line[ 0] : This is a tale (14 chars)
 line[ 1] : Of Captain Jack Sparrow (23 chars)
 line[ 2] : A Pirate So Brave (17 chars)
 line[ 3] : On the Seven Seas. (18 chars)

$ ./bin/readfile dat/captnjack2.txt
 line[ 0] : This is a tale (14 chars)
 line[ 1] :  (0 chars)
 line[ 2] : Of Captain Jack Sparrow (23 chars)
 line[ 3] :  (0 chars)
 line[ 4] : A Pirate So Brave (17 chars)
 line[ 5] :  (0 chars)
 line[ 6] : On the Seven Seas. (18 chars)

You are essentially implementing your own version of fgets . Avoiding character-by-character read of non-seekable streams in fgets is enabled by an internal buffer associated with FILE* data structure.

Internally, fgets uses a function to fill that buffer using "raw" input-output routines. After that, fgets goes through the buffer character-by-character to determine the location of '\\n' , if any. Finally, fgets copies the content from the internal buffer into the user-supplied buffer, and null-terminates the result if there is enough space.

In order to re-create this logic you would need to define your own FILE -like struct with a pointer to buffer and a pointer indicating the current location inside the buffer. After that you would need to define your own version of fopen , which initializes the buffer and returns it to the caller. You would also need to write your own version of fclose to free up the buffer. Once all of this is in place, you can implement your fgets by following the logic outlined above.

char *buffer = malloc(sizeof (char) * 1024); 
read(fd, buffer, 1024);        
printf("%s", buffer);

There are several errors in the above code.

First, malloc is not a syscall (and neither is perror(3) ....). And sizeof(char) is 1 by definition. If you want to only use syscalls (listed in syscalls(2) ) you'll need to use mmap(2) and you should request virtual memory in multiple of the page size (see getpagesize(2) or sysconf(3) ....), which is often (but not always) 4 kilobytes. If you can use malloc you should code against its failure and you'll better zero the obtained buffer, so at least

const int bufsiz = 1024;
char*buffer = malloc(bufsiz);
if (!buffer) { perror("malloc"); exit(EXIT_FAILURE); };
memset(buffer, 0, bufsiz);

Then, and more importantly, read(2) is returning a number that you should always use (at least against failure):

ssize_t rdcnt = read(fd, buffer, bufsiz);
if (rdcnt<0) { perror("read"); exit(EXIT_FAILURE); };

You'll generally increment some pointer (by rdcnt bytes) if the rdcnt is positive. A zero count means an end-of-file.

At last your printf is using <stdio.h> and you might use write(2) instead. If using printf , remember that it is buffering. Either end the format with a \\n , or use fflush(3)

If you use printf , be sure to end the string with a zero byte. A possibility might have been to pass bufsiz-1 to your read ; since we zeroed the zone before, we are sure to have a terminating zero byte.

BTW, you could study the source code of some free software implementation of the C standard library such as musl-libc or GNU libc

Don't forget to compile with all warnings and debug info ( gcc -Wall -Wextra -g ), to use the debugger ( gdb ), perhaps valgrind & strace(1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM