简体   繁体   English

Char * vs C ++中的字符串速度

[英]Char* vs String Speed in C++

I have a C++ program that will read in data from a binary file and originally I stored data in std::vector<char*> data . 我有一个C ++程序,它将读取二进制文件中的数据,最初我将数据存储在std::vector<char*> data I have changed my code so that I am now using strings instead of char*, so that std::vector<std::string> data . 我已经更改了我的代码,以便我现在使用字符串而不是char *,以便std::vector<std::string> data Some changes I had to make was to change from strcmp to compare for example. 我必须做的一些改变是从strcmp改为compare例如。

However I have seen my execution time dramatically increase. 但是我看到我的执行时间急剧增加。 For a sample file, when I used char* it took 0.38s and after the conversion to string it took 1.72s on my Linux machine. 对于示例文件,当我使用char *时需要0.38s,在转换为字符串后,我的Linux机器上花了1.72s。 I observed a similar problem on my Windows machine with execution time increasing from 0.59s to 1.05s. 我在Windows机器上观察到类似的问题,执行时间从0.59s增加到1.05s。

I believe this function is causing the slow down. 我相信这个功能导致速度减慢。 It is part of the converter class, note private variables designated with _ at the end of variable name. 它是转换器类的一部分,请注意在变量名末尾用_指定的私有变量。 I clearly am having memory problems here and stuck in between C and C++ code. 我显然在这里遇到了内存问题,并且陷入了C和C ++代码之间。 I want this to be C++ code, so I updated the code at the bottom. 我希望这是C ++代码,所以我更新了底部的代码。

I access ids_ and names_ many times in another function too, so access speed is very important. 我也在另一个函数中多次访问ids_names_ ,因此访问速度非常重要。 Through the use of creating a map instead of two separate vectors, I have been able to achieve faster speeds with more stable C++ code. 通过使用创建map而不是两个单独的向量,我已经能够使用更稳定的C ++代码实现更快的速度。 Thanks to everyone! 谢谢大家!

Example NewList.Txt 示例NewList.Txt

2515    ABC 23.5    32  -99 1875.7  1  
1676    XYZ 12.5    31  -97 530.82  2  
279  FOO 45.5    31  -96  530.8  3  

OLD Code: 旧代码:

void converter::updateNewList(){
    FILE* NewList;
    char lineBuffer[100];
    char* id = 0;
    char* name = 0;

    int l = 0;
    int n;

    NewList = fopen("NewList.txt","r");
    if (NewList == NULL){
        std::cerr << "Error in reading NewList.txt\n";
        exit(EXIT_FAILURE);
    } 

    while(!feof(NewList)){
        fgets (lineBuffer , 100 , NewList); // Read line    
        l = 0;
        while (!isspace(lineBuffer[l])){
            l = l + 1;
        }

        id = new char[l];
        switch (l){
            case 1: 
                n = sprintf (id, "%c", lineBuffer[0]);
                break;
            case 2:
                n = sprintf (id, "%c%c", lineBuffer[0], lineBuffer[1]);
                break;
            case 3:
                n = sprintf (id, "%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2]);        
                break;
            case 4:
                n = sprintf (id, "%c%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2],lineBuffer[3]);
                break;
            default:
                n = -1;
                break;
        }
        if (n < 0){
            std::cerr << "Error in processing ids from NewList.txt\n";
            exit(EXIT_FAILURE);
        }

        l = l + 1;
        int s = l;
        while (!isspace(lineBuffer[l])){
            l = l + 1;
        }
        name = new char[l-s];
        switch (l-s){
            case 2:
                n = sprintf (name, "%c%c", lineBuffer[s+0], lineBuffer[s+1]);
                break;
            case 3:
                n = sprintf (name, "%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2]);
                break;
            case 4:
                n = sprintf (name, "%c%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2],lineBuffer[s+3]);
                break;
            default:
                n = -1;
                break;
        }
        if (n < 0){
            std::cerr << "Error in processing short name from NewList.txt\n";
            exit(EXIT_FAILURE);
        }


        ids_.push_back ( std::string(id) );
        names_.push_back(std::string(name));
    }

    bool isFound = false;
    for (unsigned int i = 0; i < siteNames_.size(); i ++) {
        isFound = false;
        for (unsigned int j = 0; j < names_.size(); j ++) {
            if (siteNames_[i].compare(names_[j]) == 0){
                isFound = true;
            }
        }
    }

    fclose(NewList);
    delete [] id;
    delete [] name;
}

C++ CODE C ++代码

void converter::updateNewList(){
    std::ifstream NewList ("NewList.txt");

    while(NewList.good()){
        unsigned int id (0);
        std::string name;

        // get the ID and name
        NewList >> id >> name;

        // ignore the rest of the line
        NewList.ignore( std::numeric_limits<std::streamsize>::max(), '\n');

        info_.insert(std::pair<std::string, unsigned int>(name,id));

    }

    NewList.close();
}

UPDATE: Follow up question: Bottleneck from comparing strings and thanks for the very useful help! 更新:跟进问题: 比较字符串的瓶颈和感谢非常有用的帮助! I will not be making these mistakes in the future! 我将来不会犯这些错误!

My guess it that it should be tied to the vector<string>'s performance 我猜它应该与vector <string>的性能联系起来

About the vector 关于矢量

A std::vector works with an internal contiguous array, meaning that once the array is full, it needs to create another, larger array, and copy the strings one by one, which means a copy-construction and a destruction of string which had the same contents, which is counter-productive... std::vector与内部连续数组一起工作,这意味着一旦数组已满,它需要创建另一个更大的数组,并逐个复制字符串,这意味着复制构造和字符串的破坏相同的内容,适得其反......

To confirm this easily, then use a std::vector<std::string *> and see if there is a difference in performance. 要轻松确认这一点,请使用std::vector<std::string *>并查看性能是否存在差异。

If this is the case, they you can do one of those four things: 如果是这种情况,你可以做以下四件事之一:

  1. if you know (or have a good idea) of the final size of the vector, use its method reserve() to reserve enough space in the internal array, to avoid useless reallocations. 如果您知道(或者有一个好主意)向量的最终大小,请使用其方法reserve()在内部数组中保留足够的空间,以避免无用的重新分配。
  2. use a std::deque , which works almost like a vector 使用std::deque ,它几乎像一个向量
  3. use a std::list (which doesn't give you random access to its items) 使用std::list (它不会让你随机访问它的项目)
  4. use the std::vector<char *> 使用std :: vector <char *>

About the string 关于字符串

Note: I'm assuming that your strings\\char * are created once, and not modified (through a realloc, an append, etc.). 注意:我假设你的strings \\ char *创建一次,而不是修改(通过realloc,append等)。

If the ideas above are not enough, then... 如果上述想法不够,那么......

The allocation of the string object's internal buffer is similar to a malloc of a char * , so you should see little or no differences between the two. 字符串对象的内部缓冲区的分配类似于char *的malloc,因此您应该看到两者之间很少或没有区别。

Now, if your char * are in truth char[SOME_CONSTANT_SIZE] , then you avoid the malloc (and thus, will go faster than a std::string). 现在,如果你的char *实际上是char[SOME_CONSTANT_SIZE] ,那么你就避免使用malloc(因此,它会比std :: string更快)。

Edit 编辑

After reading the updated code, I see the following problems. 阅读更新的代码后,我看到以下问题。

  1. if ids_ and names_ are vectors, and if you have the slightest idea of the number of lines, then you should use reserve() on ids_ and and names_ 如果ids_和nam​​es_是向量,如果你对行数有一点想法,那么你应该在ids_和nam​​es_上使用reserve()
  2. consider making ids_ and names_ deque, or lists. 考虑制作ids_和nam​​es_ deque或列表。
  3. faaNames_ should be a std::map, or even a std::unordered_map (or whatever hash_map you have on your compiler). faaNames_应该是std :: map,甚至是std :: unordered_map(或者你的编译器上有的hash_map)。 Your search currently is two for loops, which is quite costly and inneficient. 你的搜索目前是两个for循环,这是非常昂贵和不利的。
  4. Consider comparing the length of the strings before comparing its contents. 在比较其内容之前,请考虑比较字符串的长度。 In C++, the length of a string (ie std::string::length()) is a zero cost operation) 在C ++中,字符串的长度(即std :: string :: length())是零成本操作)
  5. Now, I don't know what you're doing with the isFound variable, but if you need to find only ONE true equality, then I guess you should work on the algorithm (I don't know if there is already one, see http://www.cplusplus.com/reference/algorithm/ ), but I believe this search could be made a lot more efficient just by thinking on it. 现在,我不知道你在使用isFound变量做了什么,但是如果你只需要找到一个真正的相等,那么我猜你应该研究算法(我不知道是否已经有一个,看看http://www.cplusplus.com/reference/algorithm/ ),但我相信通过思考可以提高搜索效率。

Other comments: 其他的建议:

  1. Forget the use of int for sizes and lengths in STL. 忘记在STL中使用int来表示大小和长度。 At very least, use size_t . 至少,使用size_t In 64-bit, size_t will become 64-bit, while int will remain 32-bits, so your code is not 64-bit ready (in the other hand, I see few cases of incoming 8 Go strings... but still, better be correct...) 在64位中,size_t将变为64位,而int将保持32位,因此您的代码不是64位就绪(另一方面,我看到传入8个Go字符串的情况很少......但是,更好的是正确...)

Edit 2 编辑2

The two (so called C and C++) codes are different. 这两个(所谓的C和C ++)代码是不同的。 The "C code" expects ids and names of length lesser than 5, or the program exists with an error. “C代码”需要长度小于5的ID和名称,否则程序存在错误。 The "C++ code" has no such limitation. “C ++代码”没有这样的限制。 Still, this limitation is ground for massive optimization, if you confirm names and ids are always less then 5 characters. 尽管如此,如果您确认名称和ID始终小于5个字符,则此限制是大规模优化的基础。

Resize vector to large enough size before you start populating it. 在开始填充之前,将矢量调整为足够大的大小。 Or, use pointers to strings instead of strings. 或者,使用指向字符串而不是字符串的指针。

The thing is that the strings are being copied each time the vector is being auto-resized. 问题是每次向量自动调整大小时都会复制字符串。 For small objects such as pointers this cost nearly nothing, but for strings the whole string is copied in full. 对于诸如指针之类的小对象,这几乎没有任何成本,但对于字符串,整个字符串被完整复制。

And id and name should be string instead of char* , and be initialized like this (assuming that you still use string instead of string* ): id和name应该是string而不是char* ,并且像这样初始化(假设你仍然使用string而不是string* ):

id = string(lineBuffer, lineBuffer + l);
...
name = string(lineBuffer + s, lineBuffer + s + l);
...
ids_.push_back(id);
names_.push_back(name);

Before fixing something make sure that it is bottleneck. 在修复之前确保它是瓶颈。 Otherwise you are wasting your time. 否则你就是在浪费时间。 Plus this sort of optimization is microoptimization. 此外,这种优化是微优化。 If you are doing microoptimization in C++ then consider using bare C. 如果你在C ++中进行微优化,那么考虑使用裸C.

Except for std::string, this is a C program. 除了std :: string之外,这是一个C程序。

Try using fstream, and use the profiler to detect the bottle neck. 尝试使用fstream,并使用探查器检测瓶颈。

You can try to reserve a number of vector values in order to reduce the number of allocations (which are costly), as said Dialecticus (probably from the ancient Roma?). 您可以尝试reserve一些vector值,以减少分配数量(这是昂贵的),如同Dialecticus(可能来自古罗马人?)。

But there is something that may deserve some observation: how do you store the strings from the file, do you perform concatenations etc... 但是有一些东西可能值得一些观察:你如何存储文件中的字符串,你执行连接等...

In C, strings (which do not exist per say - they don't have a container from a library like the STL) need more work to deal with, but at least we know what happens clearly when dealing with them. 在C中,字符串(每个说不存在 - 它们没有来自像STL这样的库的容器)需要更多的工作来处理,但至少我们知道在处理它们时会发生什么。 In the STL, each convenient operation (meaning requiring less work from the programmer) may actually require a lot of operations behind the scene, within the string class, depending on how you use it. 在STL中,每个方便的操作(意味着需要较少的程序员工作)实际上可能需要在string类中的场景后面进行大量操作,具体取决于您使用它的方式。

So, while the allocations / freeings are a costly process, the rest of the logic, especially the strings process, may / should probably be looked at as well. 因此,虽然分配/释放是一个代价高昂的过程,但其余的逻辑,特别是字符串过程,也可能/应该被看到。

I believe the main issue here is that your string version is copying things twice -- first into dynamically allocated char[] called name and id , and then into std::string s, while your vector<char *> version probably does not do that. 我认为这里的主要问题是你的字符串版本是复制东西两次 - 首先是动态分配的char[]名为nameid ,然后是std::string s,而你的vector<char *>版本可能不做那。 To make the string version faster, you need to read directly into the strings and get rid of all the redundant copies 要使字符串版本更快,您需要直接读取字符串并删除所有冗余副本

streams take care of a lot of the heavy lifting for you. 溪流为你解决了很多繁重的工作。 Stop doing it all yourself, and let the library help you: 别自己动手了,让图书馆帮助你:

void converter::updateNewList(){
    std::ifstream NewList ("NewList.txt");

    while(NewList.good()){
        int id (0);
        std::string name;

        // get the ID and name
        NewList >> id >> name;

        // ignore the rest of the line
        NewList.ignore( numeric_limits<streamsize>::max(), '\n');

        ids_.push_back (id);
        names_.push_back(name);
    }

    NewList.close();
}

There's no need to do the whitespace-tokenizing manually. 没有必要手动进行空白标记。

Also, you may find this site a helpful reference: http://www.cplusplus.com/reference/iostream/ifstream/ 此外,您可能会发现此站点是一个有用的参考: http//www.cplusplus.com/reference/iostream/ifstream/

You can use a profiler to find out where your code consumes most time. 您可以使用分析器找出代码消耗最多时间的位置。 If you are for example using gcc, you can compile your program with -pg. 例如,如果您使用gcc,则可以使用-pg编译程序。 When you run it, it saves profiling results in a file. 运行它时,它会将分析结果保存在文件中。 You can the run gprof on the binary to get human readable results. 您可以在二进制文件上运行gprof以获得人类可读的结果。 Once you know where most time is consumed you can post that piece of code for further questions. 一旦你知道大部分时间消耗在哪里,你可以发布这段代码以获得更多问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM