简体   繁体   English

将矢量更改为数组会使我的程序变慢

[英]Changing a vector into an array makes my program slower

I profiled a program of mine and found out that the very hotspot was levenshtein_distance , called recursively. 我描述了我的一个程序,发现最热门的地方是levenshtein_distance ,递归调用。 I decided to try and optimize it. 我决定尝试优化它。

lvh_distance levenshtein_distance( const std::string & s1, const std::string & s2 )
{
    const size_t len1 = s1.size(), len2 = s2.size();
    std::vector<unsigned int> col( len2+1 ), prevCol( len2+1 );

    const size_t prevColSize = prevCol.size();
    for( unsigned int i = 0; i < prevColSize; i++ )
        prevCol[i] = i;

    for( unsigned int i = 0, j; i < len1; ++i )
    {
        col[0] = i+1;
        const char s1i = s1[i];
        for( j = 0; j < len2; ++j )
        {
            const auto minPrev = 1 + std::min( col[j], prevCol[1 + j] );
            col[j+1] = std::min( minPrev, prevCol[j] + ( static_cast<unsigned int>( s1i != s2[j] ) ) );
        }
        col.swap( prevCol );
    }
    return prevCol[len2];
}

TL;DR: I changed std::stringstd::array TL; DR:我改变了std::stringstd::array

War Story: And after running vtune on it, I found that the line that updates col[j+1] was the one slowing down everything (90% of the time spent on it). 战争故事:在运行vtune之后,我发现更新col[j+1]是减慢一切的行(90%的时间花在它上面)。 I thought: OK, maybe this is an aliasing problem, maybe the compiler cannot determine that the character arrays within the string objects are unaliased as they are masked by the string interface and spends 90% of his time checking that no other part of the program modified them. 我想:好吧,也许这是一个别名问题,也许编译器无法确定字符串对象中的字符数组是否因为它们被字符串接口屏蔽而没有因果关系,并花费90%的时间检查该程序的其他部分修改了它们。

So I changed my string into a static array, because there, there is no dynamic memory, and the next step would have been using restrict to help the compiler. 所以我将我的字符串更改为静态数组,因为那里没有动态内存,下一步就是使用restrict来帮助编译器。 But in the meantime, I decided to check if I had gained any performance by doing so. 但与此同时,我决定通过这样做来检查我是否获得了任何表现。

lvh_distance levenshtein_distance( const std::string & s1, const std::string & s2 )
{
    const size_t len1 = s1.size(), len2 = s2.size();
    static constexpr unsigned MAX_STRING_SIZE = 512;
    assert(len1 < MAX_STRING_SIZE && len2 < MAX_STRING_SIZE);
    static std::array<unsigned int, MAX_STRING_SIZE> col, prevCol;

    for( unsigned int i = 0; i < len2+1; ++i )
        prevCol[i] = i;

    // the rest is unchanged
}

TL;DR : now it runs slow. TL; DR :现在它运行缓慢。

What happened is that I lost performance. 发生的事情是我失去了表现。 A lot. 很多。 Instead of running in ~ 6 seconds, my sample program now runs in 44 seconds. 我的示例程序现在在44秒内运行,而不是在大约6秒内运行。 Using vtune to profile again shows that a function is called over and over again: std::swap (for you, gcc folks, this is in bits/move.h), which is in turn called by std::swap_ranges (bits/stl_algobase.h). 再次使用vtune进行配置文件显示一个函数被反复调用: std::swap (对你来说,gcc伙计,这是在bits / move.h中),然后由std::swap_ranges (bits / stl_algobase.h)。

I suppose that std::min is implemented using quicksort , which explains why there is swapping around, but I don't understand why swapping, in that case, takes so much time. 我想std::min是使用quicksort实现的,这解释了为什么有交换,但我不明白为什么交换,在这种情况下,需要花费很多时间。

EDIT : Compiler options: I am using gcc with options "-O2 -g -DNDEBUG" and a bunch of warning specifiers. 编辑 :编译器选项:我使用gcc选项“-O2 -g -DNDEBUG”和一堆警告说明符。

For an experiment I ran a version of your original code largely unmodified with a pair of short strings an got timings of ~36s for the array version and ~8s for the vector version. 对于一个实验,我运行了一个原始代码版本,在很大程度上未经修改,带有一对短字符串,阵列版本的时间约为36s,矢量版本的时间约为8s。

Your version seems to depend very much on the choice of MAX_STRING_SIZE . 您的版本似乎很大程度上取决于MAX_STRING_SIZE的选择。 When I used 50 instead of 512 (which just fitted my strings), the timing for the array version went down to about 16s. 当我使用50而不是512(只是适合我的字符串)时,阵列版本的时间下降到大约16秒。

I then performed this by-hand translation of your main loop to get rid of the explicit swap. 然后,我执行了主循环的这种手动翻译,以摆脱显式交换。 This further reduced the time of the array version to 11s, and more interestingly, now made the array version timing independent of the choice of MAX_STRING_SIZE . 这进一步将阵列版本的时间缩短到11s,更有趣的是,现在使阵列版本的时序与MAX_STRING_SIZE的选择MAX_STRING_SIZE When putting it back to 512, the array version still took 11s. 当它回到512时,阵列版本仍然需要11秒。

This is good evidence that the explicit swap of the arrays is where the bulk of the perfomance issue with your version was. 这是一个很好的证据,表明数组的显式交换是您的版本的大部分性能问题。

There is a still a significant difference between the array and the vector version with the array version talking around 40% longer. 阵列和矢量版本之间仍然存在显着差异,阵列版本的使用时间延长了40%。 I haven't had a chance to investigate exactly why this might be. 我没有机会调查这可能是什么原因。

for( unsigned int i = 0, j; i < len1; ++i )
{
    {
        col[0] = i+1;
        const char s1i = s1[i];
        for( j = 0; j < len2; ++j )
        {
            const auto minPrev = 1 + std::min( col[j], prevCol[1 + j] );
            col[j+1] = std::min( minPrev, prevCol[j] + ( static_cast<unsigned int>( s1i != s2[j] ) ) );
        }
    }

    if (!(++i < len1))
        return col[len2];

    {
        prevCol[0] = i+1;
        const char s1i = s1[i];
        for( j = 0; j < len2; ++j )
        {
            const auto minPrev = 1 + std::min( prevCol[j], col[1 + j] );
            prevCol[j+1] = std::min( minPrev, col[j] + ( static_cast<unsigned int>( s1i != s2[j] ) ) );
        }
    }
}
return prevCol[len2];

First off: @DanielFischer has already in all probability pointed out what caused your performance degradation: Swapping std::arrays is a linear time operation, while swapping std::vector is a constant time operation. 首先:@DanielFischer已经很可能指出了导致性能下降的原因:交换std::arrays是一个线性时间操作,而交换std::vector是一个恒定时间操作。 While some compilers may be able to optimize this away, your gcc seems unable to do so. 虽然一些编译器可能能够优化它,但你的gcc似乎无法做到这一点。

Also important: Utilizing a static array like you did here makes your code inherently not threadsafe. 同样重要的是:像你在这里使用static数组一样,你的代码本身就不是线程安全的。 It is usually not a good idea. 这通常不是一个好主意。

Removing one of the arrays (or vectors) and the associated swap and using a dynamically allocated c-array is actually pretty easy and results in superior performance (at least for my setup). 删除其中一个数组(或向量)和相关的交换以及使用动态分配的c-array实际上非常简单,并且可以获得卓越的性能(至少对于我的设置而言)。
A few more transformations (like consistently using size_t ) results in the following function: 一些转换(如始终使用size_t )会产生以下函数:

unsigned int levenshtein_distance3( const std::string & s1, const std::string & s2 )
{
    const size_t len1 = s1.size(), len2 = s2.size();
    ::std::unique_ptr<size_t[]> col(new size_t[len2 + 1]);

    for(size_t i = 0; i < len2+1; ++i )
        col[i] = i;

    for(size_t i = 0; i < len1; ++i )
    {
        size_t lastc = col[0];
        col[0] = i+1;
        const char s1i = s1[i];
        for(size_t j = 0; j < len2; ++j )
        {
            const auto minPrev = 1 + (::std::min)(col[j], col[j + 1]);
            const auto newc = (::std::min)(minPrev, lastc + (s1i != s2[j] ? 1 : 0));
            lastc = col[j+1];
            col[j + 1] = newc;
        }
    }
    return col[len2];
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM