简体   繁体   中英

to read huge data from file and parse the date in efficient way. How to improve the performance for huge data?

I am reading huge data from a file as:

//abc.txt

10  12  14  15  129

-12 14 -18  -900 -1234

145 12 13
12

32 68 51 76 -59 -025 

- - - - etc

fun(char *p, int x, int y, int z) {

}

I have tried to use atoi , strtok , but they are real time consuming when the array is too huge and the sscanf is also very slow.

How can I improve the performance for huge data?

I am using strtok for parsing. I am looking for fast method to parse each line.

I am reading as each line and then parsing each line as:

 char * ptr;
 ptr = strtok (str," ");
 while (ptr != NULL)
 {
    int value1 = atoi(ptr) ;
    ptr = strtok (NULL, " ");
 }
  • Is there any fast way to parse the string into int ?
  • Is there any alternate approach which would be faster then above code? I am using atoi to convert char * to int .
  • Can I use other fast method to convert char * to int ?

To convert an ASCII string to an integer value, you cannot get much faster than what atoi is doing, but you may be able to speed it up by implementing a conversion function that you use inline. The version below increments the pointer past the digits scanned, so it doesn't match atoi semantics, but it should help improve parser efficiency, illustrated below. (Error checking is obviously lacking, so add it if you need it.)

static inline int my_parsing_atoi(const char *&s) {
    if (s) {
        bool neg = false;
        int val = 0;
        if (*s == '-') { neg = true; ++s; }
        for (;isdigit(*s);++s) val = 10*val + (*s - '0');
        return neg ? -val : val;
    }
    return 0;
}

const char *p = input_line;
if (p) {
    p += strspn(p, " ");
    while (*p) {
        int value1 = my_parsing_atoi(p);
        p += strspn(p, " ");
    }
}

Make sure you have profiled your code properly so that you know that your routine is compute bound and not I/O bound. Most of the time, you will be I/O bound, and the suggestions below are ways to mitigate it.

If you are using the C or C++ file reading routines, such as fread or fstream , you should be getting buffered reads which should already be pretty efficient, but you can try to use underlying OS calls, such as POSIX read , to read the files in larger blocks at a time to speed up file reading efficiency. To be really fancy, you can perform an asynchronous read of your file while you are processing it, either by using threads, or by using aio_read . You can even use mmap , and that will remove some data copying overhead, but if the file is extremely large, you will need to manage the map so that you munmap the portions of the file that have been scanned and mmap in the new portion to be scanned.

I benchmarked my parse routine above and the OP's routine using code that looked like this:

clock_t before_real;
clock_t after_real;
struct tms before;
struct tms after;
std::vector<char *> numbers;
make_numbers(numbers);
before_real = times(&before);
for (int i = 0; i < numbers.size(); ++i) {
    parse(numbers[i]);
}
after_real = times(&after);
std::cout << "user: " << after.tms_utime - before.tms_utime
          << std::endl;
std::cout << "real: " << after_real - before_real
          << std::endl;

The difference between real and user is that real is wall clock time, while user is actual time spent by the OS running the process (so context switches are not counted against the running time).

My tests had my routine running almost twice as fast as the OP's routine (compiled with g++ -O3 on a 64 bit Linux system).

You are looking in the wrong place. It isn't the parsing that is the issue, unless you are doing something truly bizarre. On a modern N Ghz CPU the cycle needed per line are tiny. What kills performance is physical I/O. Spinning stuff tends to run at 10s / sec.

I also doubt that the issue is the physical read of the file as this will be efficiently cached in the file system cache.

No, as samy.vilar hints, the issue is almost certainly a virtual memory one:

...the array is too huge...

Use the system monitor/psinfo/top to look at your app. Almost certainly it is growing a large working set as it builds up an in-memory array and your OS is paging this to disk.

So forget reading as an issue. Your real issue is how to manipulate huge data sets in memory. The approaches here are various:

  • Don't. Batch up the data and manipulate batches.
  • Use space-efficient storage (eg compact elements).
  • Allocate more memory resources.

There are many discussions around this on SO.

If your file is truly huge, then the IO is what's killing you, and not the parsing. Every time you read a line, you're executing a system call, which can be quite expensive.

A more efficient alternative may be to use Memory-Mapped File IO . If you're working on a POSIX system such as Linux, you can use the mmap command which loads the file all at once and returns a pointer to its location in memory. The memory manager then takes care of reading and swapping the file in/out as you access data through that pointer.

This would look something like this

#include <sys/mman.h>
int fd = open( 'abc.txt' , O_RDONLY );
char *ptr = mmap( NULL , length , PROT_READ , MAP_PRIVATE , fd , 0 );

but I would strongly advise you to read the man page and find the best options for yourself.

  1. If your file contains int numbers, you can use operator>> , but this is c++ only solution. Something like :

     std::fstream f("abc.txt"); int value = 0; f >> value 
  2. If you convert your file to contain binary numbers representation, you will have more options to improve performances. Not only it avoid parsing numbers from string into type, but also you have another options to access your data (like for example using mmap ).

First of all, a general recommendation is to always use profiling to check that it actually is the translation that is slow, and not something else, such as reading the file physically from disk.

You may be able to improve the performance by writing your own, minimal number-parsing function. strtok modifies the string, so it will not be optimally fast, and if you know that all the numbers are decimal integers, and you don't need any error checking, you can simplify the translation a bit.

Some code without strtok that may speed up the processing of a line, if it actually is the translation and not (for example) I/O that is the problem.

void handle_one_number(int number) {
    // ....
}

void handle_numbers_in_buffer(char *buffer) {
    while (1) {
        while (*buffer != '\0' && isspace(*buffer))
            ++buffer;
        if (*buffer == '\0')
            return;
        int negative = 0;
        if (*buffer == '-') {
            negative = 1;
            ++buffer;
        }
        int number = 0;
        while (isdigit(*buffer)) {
            number = number * 10 + *buffer - '0';
            ++buffer;
        }
        if (negative)
            number = -number;
        handle_one_number(number);
    }
}

I actually went and ran some benchmarks. I had expected the I/O to be dominant, but it turns out that (with the usual caveat about "on my particular system, with my particular compiler") the parsing of numbers takes quite a lot of time.

By changing from the strtok version to my code above I managed to improve the time for the translation of 100 million numbers (with the text already in memory) from 5.2 seconds to around 1.1 second. When reading from a slow disk (Caviar Green) I measured an improvement from 5.9 seconds to 3.5 seconds. When reading from an SSD, I measured an improvement from 5.8 to 1.8 seconds.

I also tried reading the file directly, using while (fscanf(f, "%d", ....) == 1) .... , but that turned out to be much slower (10 seconds), probably since fscanf is thread-safe, and more calls require more locking.

(GCC 4.5.2 on Ubuntu 11.04 with -O2 optimization, several executions of each version, flushing disk caches between runs, i7 processor.)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM