简体   繁体   English

从文件中读取大量数据并以有效的方式解析日期。如何提高海量数据的性能?

[英]to read huge data from file and parse the date in efficient way. How to improve the performance for huge data?

I am reading huge data from a file as: 我正在从文件中读取大量数据:

//abc.txt //abc.txt

10  12  14  15  129

-12 14 -18  -900 -1234

145 12 13
12

32 68 51 76 -59 -025 

- - - - etc

fun(char *p, int x, int y, int z) {

}

I have tried to use atoi , strtok , but they are real time consuming when the array is too huge and the sscanf is also very slow. 我曾尝试使用atoistrtok ,但是当阵列太大且sscanf也非常慢时它们是实时消耗的。

How can I improve the performance for huge data? 如何改善大数据的性能?

I am using strtok for parsing. 我正在使用strtok进行解析。 I am looking for fast method to parse each line. 我正在寻找解析每一行的快速方法。

I am reading as each line and then parsing each line as: 我正在读取每一行,然后解析每一行:

 char * ptr;
 ptr = strtok (str," ");
 while (ptr != NULL)
 {
    int value1 = atoi(ptr) ;
    ptr = strtok (NULL, " ");
 }
  • Is there any fast way to parse the string into int ? 有没有快速的方法将字符串解析为int
  • Is there any alternate approach which would be faster then above code? 是否有任何替代方法比代码更快? I am using atoi to convert char * to int . 我正在使用atoichar *转换为int
  • Can I use other fast method to convert char * to int ? 我可以使用其他快速方法将char *转换为int吗?

To convert an ASCII string to an integer value, you cannot get much faster than what atoi is doing, but you may be able to speed it up by implementing a conversion function that you use inline. 要将ASCII字符串转换为整数值,您不能比atoi正在做的快得多,但您可以通过实现内联使用的转换函数来加快速度。 The version below increments the pointer past the digits scanned, so it doesn't match atoi semantics, but it should help improve parser efficiency, illustrated below. 下面的版本将指针递增超过扫描的数字,因此它与atoi语义不匹配,但它应该有助于提高解析器效率,如下所示。 (Error checking is obviously lacking, so add it if you need it.) (错误检查显然缺乏,因此如果需要,请添加它。)

static inline int my_parsing_atoi(const char *&s) {
    if (s) {
        bool neg = false;
        int val = 0;
        if (*s == '-') { neg = true; ++s; }
        for (;isdigit(*s);++s) val = 10*val + (*s - '0');
        return neg ? -val : val;
    }
    return 0;
}

const char *p = input_line;
if (p) {
    p += strspn(p, " ");
    while (*p) {
        int value1 = my_parsing_atoi(p);
        p += strspn(p, " ");
    }
}

Make sure you have profiled your code properly so that you know that your routine is compute bound and not I/O bound. 确保您已正确分析了代码,以便您知道您的例程是计算绑定而不是I / O绑定。 Most of the time, you will be I/O bound, and the suggestions below are ways to mitigate it. 大多数情况下,您将受到I / O限制,下面的建议是减轻它的方法。

If you are using the C or C++ file reading routines, such as fread or fstream , you should be getting buffered reads which should already be pretty efficient, but you can try to use underlying OS calls, such as POSIX read , to read the files in larger blocks at a time to speed up file reading efficiency. 如果您正在使用C或C ++文件读取例程,例如freadfstream ,那么您应该获得缓冲读取,这应该已经非常有效,但您可以尝试使用底层OS调用(例如POSIX read )来读取文件在较大的块中一次加快文件读取效率。 To be really fancy, you can perform an asynchronous read of your file while you are processing it, either by using threads, or by using aio_read . 实际上,您可以在处理文件时通过使用线程或使用aio_read执行异步读取。 You can even use mmap , and that will remove some data copying overhead, but if the file is extremely large, you will need to manage the map so that you munmap the portions of the file that have been scanned and mmap in the new portion to be scanned. 您甚至可以使用mmap ,这将消除一些数据开销复制,但如果文件非常大,您将需要管理地图,使您munmap已扫描的文件和部分mmap在新的部分被扫描。

I benchmarked my parse routine above and the OP's routine using code that looked like this: 我使用如下代码对上面的解析例程和OP的例程进行了基准测试:

clock_t before_real;
clock_t after_real;
struct tms before;
struct tms after;
std::vector<char *> numbers;
make_numbers(numbers);
before_real = times(&before);
for (int i = 0; i < numbers.size(); ++i) {
    parse(numbers[i]);
}
after_real = times(&after);
std::cout << "user: " << after.tms_utime - before.tms_utime
          << std::endl;
std::cout << "real: " << after_real - before_real
          << std::endl;

The difference between real and user is that real is wall clock time, while user is actual time spent by the OS running the process (so context switches are not counted against the running time). realuser之间的区别在于, real是挂钟时间,而user是运行该过程的操作系统所花费的实际时间(因此上下文切换不计入运行时间)。

My tests had my routine running almost twice as fast as the OP's routine (compiled with g++ -O3 on a 64 bit Linux system). 我的测试使我的例程运行速度几乎是OP的例程的两倍(在64位Linux系统上使用g++ -O3编译)。

You are looking in the wrong place. 你正在寻找错误的地方。 It isn't the parsing that is the issue, unless you are doing something truly bizarre. 除非你正在做一些真正奇怪的事情,否则问题不在于解析。 On a modern N Ghz CPU the cycle needed per line are tiny. 在现代的N Ghz CPU上,每行所需的周期很小。 What kills performance is physical I/O. 杀死性能的是物理I / O. Spinning stuff tends to run at 10s / sec. 纺纱材料往往以10秒/秒的速度运转。

I also doubt that the issue is the physical read of the file as this will be efficiently cached in the file system cache. 我还怀疑问题是文件的物理读取,因为这将有效地缓存在文件系统缓存中。

No, as samy.vilar hints, the issue is almost certainly a virtual memory one: 不,正如samy.vilar暗示的那样,这个问题几乎可以肯定是一个虚拟内存:

...the array is too huge... ......阵列太大了......

Use the system monitor/psinfo/top to look at your app. 使用系统监视器/ psinfo / top查看您的应用程序。 Almost certainly it is growing a large working set as it builds up an in-memory array and your OS is paging this to disk. 几乎可以肯定,它正在增加一个大型工作集,因为它构建了一个内存阵列,而你的操作系统正在将其分配到磁盘。

So forget reading as an issue. 所以忘记阅读是一个问题。 Your real issue is how to manipulate huge data sets in memory. 您真正的问题是如何操作内存中的大量数据集。 The approaches here are various: 这里的方法是多种多样的:

  • Don't. 别。 Batch up the data and manipulate batches. 批量处理数据并操纵批次。
  • Use space-efficient storage (eg compact elements). 使用节省空间的存储(例如紧凑元素)。
  • Allocate more memory resources. 分配更多内存资源。

There are many discussions around this on SO. 围绕这个问题有很多讨论。

If your file is truly huge, then the IO is what's killing you, and not the parsing. 如果你的文件真的很大,那么IO就是在扼杀你,而不是解析。 Every time you read a line, you're executing a system call, which can be quite expensive. 每次读取一行时,您都在执行系统调用,这可能非常昂贵。

A more efficient alternative may be to use Memory-Mapped File IO . 更有效的替代方案可能是使用内存映射文件IO If you're working on a POSIX system such as Linux, you can use the mmap command which loads the file all at once and returns a pointer to its location in memory. 如果您正在使用诸如Linux之类的POSIX系统,则可以使用mmap命令一次性加载文件并返回指向其在内存中的位置的指针。 The memory manager then takes care of reading and swapping the file in/out as you access data through that pointer. 然后,当您通过该指针访问数据时,内存管理器负责读取和交换文件。

This would look something like this 这看起来像这样

#include <sys/mman.h>
int fd = open( 'abc.txt' , O_RDONLY );
char *ptr = mmap( NULL , length , PROT_READ , MAP_PRIVATE , fd , 0 );

but I would strongly advise you to read the man page and find the best options for yourself. 但我强烈建议您阅读手册页,为自己找到最佳选择。

  1. If your file contains int numbers, you can use operator>> , but this is c++ only solution. 如果您的文件包含int编号,则可以使用operator >> ,但这只是c ++解决方案。 Something like : 就像是 :

     std::fstream f("abc.txt"); int value = 0; f >> value 
  2. If you convert your file to contain binary numbers representation, you will have more options to improve performances. 如果将文件转换为包含二进制数表示,则可以使用更多选项来提高性能。 Not only it avoid parsing numbers from string into type, but also you have another options to access your data (like for example using mmap ). 它不仅可以避免将数字从字符串解析为类型,还可以使用其他选项来访问数据(例如使用mmap )。

First of all, a general recommendation is to always use profiling to check that it actually is the translation that is slow, and not something else, such as reading the file physically from disk. 首先,一般建议总是使用分析来检查它实际上是转换是否缓慢,而不是其他东西,例如从磁盘中物理读取文件。

You may be able to improve the performance by writing your own, minimal number-parsing function. 您可以通过编写自己的最小数字解析函数来提高性能。 strtok modifies the string, so it will not be optimally fast, and if you know that all the numbers are decimal integers, and you don't need any error checking, you can simplify the translation a bit. strtok修改字符串,因此它不会是最佳的快速,并且如果您知道所有数字都是十进制整数,并且您不需要任何错误检查,您可以稍微简化转换。

Some code without strtok that may speed up the processing of a line, if it actually is the translation and not (for example) I/O that is the problem. 一些没有strtok的代码可以加速一行的处理,如果它实际上是转换而不是(例如)I / O那就是问题。

void handle_one_number(int number) {
    // ....
}

void handle_numbers_in_buffer(char *buffer) {
    while (1) {
        while (*buffer != '\0' && isspace(*buffer))
            ++buffer;
        if (*buffer == '\0')
            return;
        int negative = 0;
        if (*buffer == '-') {
            negative = 1;
            ++buffer;
        }
        int number = 0;
        while (isdigit(*buffer)) {
            number = number * 10 + *buffer - '0';
            ++buffer;
        }
        if (negative)
            number = -number;
        handle_one_number(number);
    }
}

I actually went and ran some benchmarks. 我实际上去了一些基准测试。 I had expected the I/O to be dominant, but it turns out that (with the usual caveat about "on my particular system, with my particular compiler") the parsing of numbers takes quite a lot of time. 我原本以为I / O占主导地位,但事实证明(对于“在我的特定系统上,通过我的特定编译器”,通常需要注意)解析数字需要花费很多时间。

By changing from the strtok version to my code above I managed to improve the time for the translation of 100 million numbers (with the text already in memory) from 5.2 seconds to around 1.1 second. 通过从strtok版本更改为上面的代码,我设法将翻译1亿个数字(文本已经在内存中)的时间从5.2秒提高到1.1秒左右。 When reading from a slow disk (Caviar Green) I measured an improvement from 5.9 seconds to 3.5 seconds. 当从慢盘(Caviar Green)读取时,我测量了从5.9秒到3.5秒的改进。 When reading from an SSD, I measured an improvement from 5.8 to 1.8 seconds. 从SSD读取时,我测量了5.8到1.8秒的改进。

I also tried reading the file directly, using while (fscanf(f, "%d", ....) == 1) .... , but that turned out to be much slower (10 seconds), probably since fscanf is thread-safe, and more calls require more locking. 我也尝试使用while (fscanf(f, "%d", ....) == 1) ....直接读取文件,但结果要慢得多(10秒),可能是因为fscanf是线程安全的,更多的调用需要更多的锁定。

(GCC 4.5.2 on Ubuntu 11.04 with -O2 optimization, several executions of each version, flushing disk caches between runs, i7 processor.) (GCC 4.5.2在Ubuntu 11.04上使用-O2优化,每个版本执行几次,在运行之间刷新磁盘缓存,i7处理器。)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM