简体   繁体   English

C ++ - 为什么令牌化器从文件中读取行是如此之慢?

[英]C++ - Why is reading lines from a file by tokenizer is so slow?

I am trying to read 200,000 records from a file and then use tokenizer to parse the string and remove the quotes which are around each part. 我试图从文件中读取200,000条记录,然后使用tokenizer来解析字符串并删除每个部分周围的引号。 But the running time is very high compared to normally reading a string. 但与正常读取字符串相比,运行时间非常长。 It took 25 seconds just to read these records (0.0001 second per record????). 只需要25秒读取这些记录(每条记录0.0001秒????)。 Is there any problem with my programming or if not is there a faster way to do this? 我的编程有什么问题,或者如果没有更快的方法吗?

int main()
{
    int counter = 0;
    std::string getcontent;
    std::vector<std::string> line;
    std::vector< std::vector<std::string> > lines;

    boost::escaped_list_separator<char> sep( '\\', '*', '"' ) ;
    boost::tokenizer<> tok(getcontent);

    std::ifstream openfile ("test.txt");

    if(openfile.is_open())
    {
        while(!openfile.eof())
        {
            getline(openfile,getcontent);

            // THIS LINE TAKES A LOT OF TIME
            boost::tokenizer<> tok(getcontent); 

            for (boost::tokenizer<>::iterator beg=tok.begin(); beg!=tok.end(); ++beg){
                line.push_back(*beg);
            }

            lines.push_back(line);
            line.clear();
            counter++;
        }
        openfile.close();
    }
    else std::cout << "No such file" << std::endl;

    return 0;
}

At least if I'm reading this correctly, I think I'd take a rather more C-like approach. 至少如果我正确地读到这个,我想我会采取更像C的方法。 Instead of reading a line, then breaking it up into tokens, and stripping out the characters you don't want, I'd read a character at a time, and based on the character I read, decide whether to add it to the current token, end the token and add it to the current line, or end the line and add it to the vector of lines: 而不是读取一行,然后将其分解为标记,并剥离出您不想要的字符,我一次只读取一个字符,并根据我读取的字符决定是否将其添加到当前字符中令牌,结束令牌并将其添加到当前行,或结束该行并将其添加到行向量:

#include <vector>
#include <string>
#include <stdio.h>
#include <time.h>

std::vector<std::vector<std::string> > read_tokens(char const *filename) {
    std::vector<std::vector<std::string> > lines;
    FILE *infile= fopen(filename, "r");

    int ch;

    std::vector<std::string> line;
    std::string token;

    while (EOF != (ch = getc(infile))) {
        switch(ch) {
            case '\n':
                lines.push_back(line);
                line.clear();
                token.clear();
                break;
            case '"':
                break;
            case '*':
                line.push_back(token);
                token.clear();
                break;
            default:
                token.push_back(ch);
        }
    }
    return lines;
}

int main() {
    clock_t start = clock();
    std::vector<std::vector<std::string> > lines = read_tokens("sample_tokens.txt");
    clock_t finish = clock();
    printf("%f seconds\n", double(finish-start)/CLOCKS_PER_SEC);
    return 0;
}

Doing a quick test with this on a file with a little over 200K copies of the sample you gave in the comment, it's reading and (apparently) tokenizing the data in ~3.5 second with gcc or ~4.5 seconds with VC++. 在一个文件中快速测试一下,你在评论中提供的样本量超过200K,它正在阅读和(显然)使用gcc在~3.5秒内或用VC ++将数据标记为~3.5秒。 I'd be a little surprised to see anything get a whole lot faster (at least without faster hardware). 看到任何事情变得更快(至少没有更快的硬件),我会感到有点惊讶。

As an aside, this is handling memory about as you originally did, which (at least in my opinion) is pretty strong evidence that managing memory in the vector probably isn't a major bottleneck. 顺便说一句,这就像你最初那样处理内存,至少在我看来,这是非常有力的证据,即在向量中管理内存可能不是一个主要的瓶颈。

Instead of boost::tokenizer<> tok(getcontent); 而不是boost::tokenizer<> tok(getcontent); , which constructs a new boost::tokenizer every call to getline . ,每次调用getline都会构造一个新的boost::tokenizer Use the assign member function: 使用assign成员函数:

boost::escaped_list_separator<char> sep( '\\', '*', '"' ) ;
boost::tokenizer<boost::escaped_list_separator<char>> tok(getcontent, sep);

// Other code
while(getline(openfile,getcontent))
{
    tok.assign(getcontent.begin(), getcontent.end()); // Use assign here
    line.assign(tok.begin(), tok.end()); // Instead of for-loop
    lines.push_back(line);
    counter++;
}

See if that helps. 看看是否有帮助。 Also, try allocating the vector memory beforehand if possible. 另外,如果可能,请尝试预先分配矢量存储器。

Okay, from the comments it seems you want a solution that's as fast as possible. 好的,从评论看来你似乎想要一个尽可能快的解决方案。

Here's what I would do to achieve something close to that requirement. 这就是我要做的就是达到接近这个要求的东西。

While you could probably get a memory-pool allocator to allocate your strings, STL is not my strong point so I'm going to do it by hand. 虽然你可能会得到一个内存池分配器来分配你的字符串,但STL不是我的强项,所以我将手动完成它。 Beware this is not necessarily the C++ way to do it. 要注意这不一定是C ++的方法。 So C++-heads might cringe a little. 所以C ++ - 负责人可能会有点畏缩。 Sometimes you just have to do this when you want something a little specialised. 有时,当你想要一些专门的东西时,你只需要这样做。

So, your data file is about 10 GB... Allocating that in a single block is a bad idea. 所以,你的数据文件大约是10 GB ......在一个块中分配它是一个坏主意。 Most likely your OS will refuse. 很可能你的操作系统会拒绝。 But it's fine to break it up into a whole bunch of pretty big blocks. 但将它分解成一大堆相当大的块可以。 Maybe there's a magic number here, but let's say around about 64MB. 也许这里有一个神奇的数字,但让我们说约64MB左右。 People who are paging experts could comment here? 传呼专家可以在这里发表评论吗? I remember reading once that it's good to use a little less than an exact page-size multiple (though I can't recall why), so let's just rip off a few kB: 我记得曾经读过一次使用少于一个精确的页面大小倍数(虽然我不记得为什么),这是好的,所以让我们扯掉几个KB:

const size_t blockSize = 64 * 1048576 - 4096;

Now, how about a structure to track your memory? 现在,跟踪记忆的结构怎么样? May as well make it a list so you can throw them all together. 也可以将它列为一个列表,这样你就可以将它们全部放在一起。

struct Block {
    SBlock *next;
    char *data;    // Some APIs use data[1] so you can use the first element, but
                   // that's a hack that might not work on all compilers.
} SBlock;

Right, so you need to allocate a block - you'll allocate a large chunk of memory and use the first little bit to store some information. 是的,所以你需要分配一个块 - 你将分配一大块内存并使用第一个小块来存储一些信息。 Note that you can change the data pointer if you need to align your memory: 请注意,如果需要对齐内存,可以更改data指针:

SBlock * NewBlock( size_t blockSize, SBlock *prev = NULL )
{
    SBlock * b = (SBlock*)new char [sizeof(SBlock) + blockSize];
    if( prev != NULL ) prev->next = b;
    b->next = NULL;
    b->data = (char*)(blocks + 1);      // First char following struct
    b->length = blockSize;
    return b;
}

Now you're gonna read... 现在你要读......

FILE *infile = fopen( "mydata.csv", "rb" );  // Told you C++ers would hate me

SBlock *blocks = NULL;
SBlock *block = NULL;
size_t spilloverBytes = 0;

while( !feof(infile) ) {
    // Allocate new block.  If there was spillover, a new block will already
    // be waiting so don't do anything.
    if( spilloverBytes == 0 ) block = NewBlock( blockSize, block );

    // Set list head.
    if( blocks == NULL ) blocks = block;

    // Read a block of data
    size_t nBytesReq = block->length - spilloverBytes;
    char* front = block->data + spilloverBytes;
    size_t nBytes = fread( (void*)front, 1, nBytesReq, infile );
    if( nBytes == 0 ) {
        block->length = spilloverBytes;
        break;
    }

    // Search backwards for a newline and treat all characters after that newline
    // as spillover -- they will be copied into the next block.
    char *back = front + nBytes - 1;
    while( back > front && *back != '\n' ) back--;
    back++;

    spilloverBytes = block->length - (back - front);
    block->length = back - block->data;

    // Transfer that data to a new block and resize current block.
    if( spilloverBytes > 0 ) {
        block = NewBlock( blockSize, block );
        memcpy( block->data, back, spilloverBytes );
    }
}

fclose(infile);

Okay, something like that. 好的,这样的。 You get the jist. 你得到了这个量词。 Note that at this point, you've probably read the file considerably faster than with multiple calls to std::getline . 请注意,此时,您可能比多次调用std::getline要快得多地读取文件。 You can get faster still if you can disable any caching. 如果您可以禁用任何缓存,您可以更快地获得更快。 In Windows you can use the CreateFile API and tweak it for real fast reads. 在Windows中,您可以使用CreateFile API并对其进行调整以实现快速读取。 Hence my earlier comment about potentially aligning your data blocks (to the disk sector size). 因此我之前关于可能对齐数据块(与磁盘扇区大小)的评论。 Not sure about Linux or other OS. 不确定Linux或其他操作系统。

So, this is a kind of complicated way to slurp an entire file into memory, but it's simple enough to be accessible and moderately flexible. 因此,这是将整个文件粘贴到内存中的一种复杂方式,但它足够简单,易于访问且适度灵活。 Hopefully I didn't make too many errors. 希望我没有犯太多错误。 Now you just want to go through your list of blocks and start indexing them. 现在,您只想查看块列表并开始索引它们。

I'm not going to go into huge detail here, but the general idea is this. 我不打算在这里详细介绍,但总体思路是这样的。 You tokenise in-place by blitzing NULL values at the appropriate places, and keeping track of where each token began. 通过在适当的位置闪烁NULL值并跟踪每个标记开始的位置来就地标记。

SBlock *block = blocks;

while( block ) {
    char *c = block->data;
    char *back = c + block->length;
    char *token = NULL;

    // Find first token
    while( c != back ) {
        if( c != '"' && c != '*' ** c != '\n' ) break;
        c++;
    }
    token = c;

    // Tokenise entire block
    while( c != back ) {
        switch( *c ) {
            case '"':
                // For speed, we assume all closing quotes have opening quotes.  If
                // we have closing quote without opening quote, this won't be correct
                if( token != c) {
                    *c = 0;
                    token++;
                }
                break;

            case '*':
                // Record separator
                *c = 0;
                tokens.push_back(token);  // You can do better than this...
                token = c + 1;
                break;

            case '\n':
                // Record and line separator
                *c = 0;
                tokens.push_back(token);  // You can do better than this...
                lines.push_back(tokens);  // ... and WAY better than this...
                tokens.clear();           // Arrrgh!
                token = c + 1;
                break;
        }

        c++;
    }

    // Next block.
    block = block->next;
}

Finally, you'll see those vector-like calls above. 最后,你会看到上面那些类似矢量的调用。 Now, again if you can memory-pool your vectors that's great and easy. 现在,再次,如果你可以内存池,你的向量,这是伟大和容易的。 But once again, I just never do it because I find it a lot more intuitive to just work directly with memory. 但是再一次,我从来没有这样做,因为我觉得直接使用内存更加直观。 You can do something similar to what I did with the file chunks but create memory for arrays (or lists). 您可以执行类似于我对文件块所做的操作,但为数组(或列表)创建内存。 You add all your tokens (which are just 8-byte pointers) to this memory area and add new chunks of memory as required. 您将所有令牌(只有8个字节的指针)添加到此内存区域,并根据需要添加新的内存块。

You might even make a little header that keeps track of how many items are in one of these token arrays. 您甚至可以创建一个小标题来跟踪其中一个令牌数组中有多少项。 The key is never to calculate something once that you can calculate later for no extra cost (ie an array size -- you only need to calculate that after you've added the last element). 关键是永远不会计算一些你可以稍后计算的东西,无需额外费用(即数组大小 - 你只需要在添加最后一个元素后计算)。

You do the same again with lines. 你用线条再次做同样的事情。 All you need is a pointer to the relevant part in a tokens chunk (and you have to do the spillover thing if a line eats into a new chunk if you are wanting array indexing). 您所需要的只是一个指向标记块中相关部分的指针(如果您想要数组索引,如果一行占用新的块,则必须执行溢出事务)。

What you'll end up with is an array of lines which point to arrays of tokens, which point directly into the memory you slurped out of the file.. And while there's a bit of memory wastage it's probably not excessive. 你最终得到的是一系列指向令牌数组的行,它直接指向你从文件中掏出的内存。虽然有一点内存浪费但它可能并不过分。 It's the price you pay for making your code fast. 这是您为快速编写代码所付出的代价。

I'm sure it could all be wrapped up beautifully in a few simple classes, but I've given it to you raw here. 我确信它可以在一些简单的课程中完美地包裹起来,但我已经把它给你了。 Even if you made memory-pooled a bunch of STL containers, I expect the overhead of those allocators along with the containers themselves would still make it slower than what I've given you. 即使你把一堆STL容器的内存汇集起来,我预计这些分配器和容器本身的开销仍会比我给你的慢。 Sorry about the really long answer. 很抱歉这个很长的答案。 I guess I just enjoy this stuff. 我想我只是喜欢这些东西。 Have fun, and hope this helps. 玩得开心,希望这会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM