简体   繁体   English

数据处理速度非常慢

[英]Very Slow Data Processing

Consider the following code that loads a dataset of records into a buffer and creates a Record object for each record. 考虑以下代码,该代码将记录的数据集加载到缓冲区中,并为每个记录创建一个Record对象。 A record constitutes one or more columns and this information is uncovered at run-time. 一条记录构成一个或多个列,并且此信息在运行时被发现。 However, in this particular example, I have set the number of columns to 3. 但是,在此特定示例中,我将列数设置为3。

typedef unsigned int uint;

typedef struct
{
        uint *data;

} Record;

Record *createNewRecord (short num_cols);

int main(int argc, char *argv[])
{
        time_t start_time, end_time;
        int num_cols = 3;
        char *relation;
        FILE *stream;
        int offset;

        char *filename = "file.txt";
        stream = fopen(filename, "r");
        fseek(stream, 0, SEEK_END);
        long fsize = ftell(stream);
        fseek(stream, 0, SEEK_SET);

        if(!(relation = (char*) malloc(sizeof(char) * (fsize + 1))))
        printf((char*)"Could not allocate buffer");

        fread(relation, sizeof(char), fsize, stream);
        relation[fsize] = '\0';
        fclose(stream);

        char *start_ptr = relation;
        char *end_ptr = (relation + fsize);

        while (start_ptr < end_ptr)
        {
                Record *new_record = createNewRecord(num_cols);

                for(short i = 0; i < num_cols; i++)
                {
                        sscanf(start_ptr, " %u %n",
                        &(new_record->data[i]), &offset);

                        start_ptr += offset;
                }
}

Record *createNewRecord (short num_cols)
{
        Record *r;

        if(!(r       = (Record *) malloc(sizeof(Record)))    ||
           !(r->data = (uint *) malloc(sizeof(uint) * num_cols)))
        {
                printf(("Failed to create new a record\n");
        }

        return r;
}

This code is highly inefficient. 此代码效率很低。 My dataset contains around 31 million records (~1 GB) and this code processes only ~200 records per minute. 我的数据集包含约3100万条记录(约1 GB),并且此代码每分钟仅处理约200条记录。 The reason I load the dataset into a buffer is because I'll later have multiple threads process the records in this buffer and hence I want to avoid files accesses. 我将数据集加载到缓冲区的原因是,稍后我将有多个线程处理该缓冲区中的记录,因此我想避免文件访问。 Moreover, I have a 48 GB RAM, so the dataset in memory should not be a problem. 而且,我有一个48 GB的RAM,因此内存中的数据集应该不是问题。 Any ideas on how can to speed things up?? 关于如何加快速度的任何想法?

SOLUTION: the sscanf function was actually extremely slow and inefficient.. When I switched to strtoul, the job finishes in less than a minute. 解决方案:sscanf函数实际上非常缓慢且效率低下。当我切换到strtoul时,作业将在不到一分钟的时间内完成。 Malloc-ing ~ 3 million structs of type Record took only few seconds. 只需几秒钟就可以分配约300万个Record类型的结构。

Confident that a lurking non-numeric data exist in the file. 确信文件中存在潜伏的非数字数据。

int offset;
...
sscanf(start_ptr, " %u %n", &(new_record->data[i]), &offset);
start_ptr += offset;

Notice that if the file begins with non-numeric input, offset is never set and if it had the value of 0 , start_ptr += offset; 请注意,如果文件以非数字输入开头,则永远不会设置offset并且如果其值为0 ,则start_ptr += offset; would never increment. 永远不会增加。

If a non-numeric data exist later in the file like "3x", offset will get the value of 1 , and cause the while loop to proceed slowly for it will never get an updated value. 如果稍后在文件中存在非数字数据(如“ 3x”),则offset将获得值1 ,并导致while循环缓慢进行,因为它将永远不会获得更新后的值。

Best to check results of fread() , ftell() and sscanf() for unexpected return values and act accordingly. 最好检查fread()ftell()sscanf()是否有意外的返回值并采取相应措施。

Further: long fsize may be too small a size. 进一步: long fsize可能太小。 Look to using fgetpos() and fsetpos() . 期待使用fgetpos()fsetpos()

Note: to save processing time, consider using strtoul() as it is certainly faster than sscanf(" %u %n") . 注意:为节省处理时间,请考虑使用strtoul()因为它肯定比sscanf(" %u %n")快。 Again - check for errant results. 再次-检查错误的结果。

BTW: If code needs to uses sscanf() , use sscanf("%u%n") , a tad faster and for your code and the same functionality. 顺便说一句:如果代码需要使用sscanf() ,请使用sscanf("%u%n") ,速度更快,并且代码和功能相同。

I'm not an optimization professional but I think some tips should help. 我不是优化专家,但是我认为一些提示应该有所帮助。

First of all, I suggest you use filename and num_cols as macros because they tend to be faster as literals when I don't see you changing their values in code. 首先,我建议您使用filenamenum_cols作为宏,因为当我看不到您在代码中更改它们的值时,它们通常会更快地用作文字。

Seond, using a struct for storing only one member is generally not recommended , but if you want to use it with functions you should only pass pointers. 其次,通常不建议使用仅存储一个成员的结构,但是如果要与函数一起使用,则应仅传递指针。 Since I see you're using malloc to store a struct and again for storing the only member then I suppose that is the reason why it is too slow. 由于我看到您正在使用malloc存储一个结构,并再次使用它存储唯一的成员,因此我想这就是它太慢的原因。 You're using twice the memory you need. 您正在使用所需内存的两倍。 This might not be the case with some compilers, however. 但是,某些编译器可能并非如此。 Practically, using a struct with only one member is pointless. 实际上,仅使用一个成员的结构是没有意义的。 If you want to ensure that the integer you get (in your case) is specifically a record, you can typedef it. 如果要确保所获取的整数(以您的情况为准)专门用于记录,则可以键入def。

You should also make end_pointer and fsize const for some optimization . 您还应该使end_pointerfsize const进行一些优化

Now, as for functionality, have a look at memory mapping io . 现在,关于功能,请看一下内存映射io

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM