简体   繁体   English

读取大数据时是否有比fscanf更快的方法

[英]Is there a faster way than fscanf while reading a large data

I am struggling with large mass structured data. 我正在努力处理大量的海量结构化数据。 I have a file which contains name and door numbers. 我有一个包含名称和门号的文件。 I am using fscanf to read name and that numbers then I am storing them in smaller files using fprintf. 我正在使用fscanf读取名称和那个数字,然后使用fprintf将它们存储在较小的文件中。

while ( fscanf(file, "%s %d", &people[i].name, &people[i].doorNum) > 0 ) {      
      ...
}

people is a struct array 人是一个结构数组

typedef struct {                                        
   char* name;
   int doorNum;
}person;

The file I'm trying to read is 15 GB. 我要读取的文件是15 GB。 My goal is reading them and splitting into 1GB files. 我的目标是读取它们并将其拆分为1GB的文件。 It perfectly works but it takes more than 10 minutes. 它可以正常工作,但是需要十多分钟。 How can I improve these reading and writing data processes? 如何改善这些读写数据过程?

You don't tell us what you mean by "splitting". 您没有告诉我们“分裂”的意思。

It could be that reading the fields as a string of characters and an integer is useless (maybe a single string or two separate strings are enough). 可能是将字段读取为一串字符和一个整数是没有用的(也许一个字符串或两个单独的字符串就足够了)。

Write your own scanning function with built-in knowledge of the pattern to be matched, this will certainly be more efficient. 使用要匹配的模式的内置知识编写自己的扫描功能,这肯定会更有效。 Even writing you own conversion to integer should be better. 即使将自己的转换写为整数也应该更好。

fscanf() has a lot of feature which are not likely to be used at the same time, which make it slower. fscanf()具有很多功能,无法同时使用,因此速度较慢。 I suggest that you code your own function using fread() . 我建议您使用fread()编写自己的函数。 As your function will have only one specific tasks it should be faster. 由于您的功能将只有一项特定任务,因此应该更快。

All files contain binary data. 所有文件都包含二进制数据。 Some formats are efficient, and some aren't. 有些格式有效,有些则无效。

For example; 例如; to store the number 0x1234 you could store it as the 2-byte sequence 0x34, 0x12 so that it can be reconstructed with a small number of simple/fast operations (eg value = buffer[pos] | (buffer[pos+1] << 8); ). 要存储数字0x1234,您可以将其存储为2个字节的序列0x34, 0x12以便可以通过少量的简单/快速操作(例如value = buffer[pos] | (buffer[pos+1] << 8); )。 This would be relatively efficient. 这将是相对有效的。

Alternatively; 另外; you could store it as the 5-byte sequence 0x34, 0x36, 0x36, 0x40, 0x00 where each byte represents an ASCII character in a string (with a zero terminator at the end); 您可以将其存储为5个字节的序列0x34, 0x36, 0x36, 0x40, 0x00其中每个字节代表字符串中的ASCII字符(末尾带有零终止符); then you could scan the bytes and convert them from decimal to integer using an expensive loop like this: 那么您可以扫描字节,并使用昂贵的循环将其从十进制转换为整数:

    while( (c = buffer[pos++]) != 0) {
        if( (c < '0') || (c > '9') ) {
             // Error condition(!)
        }
        value = value * 10 + c - '0';
     }

Then you could make it worse by wrapping it in "convenience" (eg fscanf() ) where the code has to scan a format string just to figure out that it needs to do something like that expensive loop. 然后,您可以将其包装在“便利”(例如fscanf() )中,从而使情况变得更糟,在这种情况下,代码必须扫描格式字符串以弄清楚它需要执行类似该昂贵的循环的操作。

Basically; 基本上; if you care about performance and/or efficiency (including file size) you need to stop using "plain text" and design a file format to suit the data; 如果您关心性能和/或效率(包括文件大小),则需要停止使用“纯文本”并设计适合数据的文件格式; especially when you're looking at huge 15 GB files. 特别是当您查看15 GB的巨大文件时。

EDIT: Added everything below! 编辑:添加下面的所有内容!

If you're stuck with "plain text", then you can get a little more performance by doing more of the parsing yourself (eg using functions like atoi() , etc). 如果您坚持使用“纯文本”,则可以通过自己进行更多的解析来获得更高的性能(例如,使用atoi()等函数)。 The next step beyond that is to use your own (more specialised) routines instead of functions like atoi() . 除此之外的下一步是使用您自己的(更专门的)例程,而不是像atoi()这样的函数。

The next step beyond that is to use a deterministic finite state machine. 除此之外的下一步是使用确定性有限状态机。 The general idea might go something like: 一般想法可能类似于:

    switch( state | buffer[pos++] ) {
        case START_OF_LINE | 'A':
        case START_OF_LINE | 'B':
        case START_OF_LINE | 'C':
            string_start = pos - 1;
            string_length = 1;
            state = GETTING_NAME;
            break;
        case GETTING_NAME | 'A':
        case GETTING_NAME | 'B':
        case GETTING_NAME | 'C':
            string_length++;
            break;
        case GETTING_NAME | ' ':
            number = 0;
            state = GETTING_NUMBER;
            break;
        case GETTING_NUMBER | '0':
            number = number * 10;
            break;
        case GETTING_NUMBER | '1':
            number = number * 10 + 1;
            break;
        case GETTING_NUMBER | '2':
            number = number * 10 + 2;
            break;
        case GETTING_NUMBER | '\n':
            create_structure(string, string_length, number);
            line++;
            state = START_OF_LINE;
            break;
        default:
            // Invalid character
            printf("Parse error at line %u!\n", line);
            break;
    }

Hopefully the compiler takes that huge switch() that you end up with and optimises it into a fast jump table. 希望编译器将最终使用的巨大switch()并优化到快速跳转表中。 Of course constructing something like this by hand is painful and error prone; 当然,用手构造类似的东西很痛苦且容易出错。 and you'd probably be able to find a "parser generator" that does it for you (based on rules). 并且您可能可以找到一个“解析器生成器”来为您做这件事(基于规则)。

The next step beyond that is multi-threading. 除此之外,下一步是多线程。 For example, you can have a thread that scans through the file searching for '\\n' characters, and when it finds one it hands the line off to a worker thread (where the worker thread can use any of the methods above to parse the line). 例如,您可以有一个线程在文件中进行扫描以搜索'\\n'字符,当找到一个字符时,它将把该行交给工作线程(该工作线程可以使用上面的任何方法来解析该字符)。线)。 In that way you can have multiple CPUs all parsing in parallel. 这样,您可以让多个CPU并行解析。

In addition to all of that; 除了所有这些; you want to be loading data from disk while you're parsing the data. 您想在解析数据时从磁盘加载数据。 For example; 例如; while you're processing the first MiB of data you want to be loading the second MiB of data in parallel; 在处理第一个MiB数据时,要并行加载第二个MiB数据; and you don't want to load 1 MiB then parse 1 MiB, then load the next MiB then parse the next MiB, etc. To do this you need to use something like (eg) POSIX asynchronous IO functions; 并且您不想加载1个MiB,然后解析1个MiB,然后加载下一个MiB,然后解析下一个MiB,依此类推。为此,您需要使用诸如POSIX异步IO函数之类的东西; or alternatively (on a 64-bit OS that supports pre-fetching) memory mapped files. 或者(在支持预取的64位OS上)使用内存映射文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM