簡體   English   中英

C ++中的加權方差和加權標准差

[英]Weighted Variance and Weighted Standard Deviation in C++

嗨,我正在嘗試計算一系列整數或浮點數的加權方差和加權標准差。 我找到了這些鏈接:

http://math.tutorvista.com/statistics/standard-deviation.html#weighted-standard-deviation

http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weightsd.pdf (警告pdf)

到目前為止,這是我的模板功能。 方差和標准差工作正常,但對於我的生活,我無法得到加權版本以匹配pdf底部的測試用例:

template <class T>
inline float    Mean( T samples[], int count )
{
    float   mean = 0.0f;

    if( count >= 1 )
    {
        for( int i = 0; i < count; i++ )
            mean += samples[i];

        mean /= (float) count;
    }

    return mean;
}

template <class T>
inline float    Variance( T samples[], int count )
{
    float   variance = 0.0f;

    if( count > 1 )
    {
        float   mean = 0.0f;

        for( int i = 0; i < count; i++ )
            mean += samples[i];

        mean /= (float) count;

        for( int i = 0; i < count; i++ )
        {
            float   sum = (float) samples[i] - mean;

            variance += sum*sum;
        }

        variance /= (float) count - 1.0f;
    }

    return variance;
}

template <class T>
inline float    StdDev( T samples[], int count )
{
    return sqrtf( Variance( samples, count ) );
}

template <class T>
inline float    VarianceWeighted( T samples[], T weights[], int count )
{
    float   varianceWeighted = 0.0f;

    if( count > 1 )
    {
        float   sumWeights = 0.0f, meanWeighted = 0.0f;
        int     numNonzero = 0;

        for( int i = 0; i < count; i++ )
        {
            meanWeighted += samples[i]*weights[i];
            sumWeights += weights[i];

            if( ((float) weights[i]) != 0.0f ) numNonzero++;
        }

        if( sumWeights != 0.0f && numNonzero > 1 )
        {
            meanWeighted /= sumWeights;

            for( int i = 0; i < count; i++ )
            {
                float   sum = samples[i] - meanWeighted;

                varianceWeighted += weights[i]*sum*sum;
            }

            varianceWeighted *= ((float) numNonzero)/((float) count*(numNonzero - 1.0f)*sumWeights);    // this should be right but isn't?!
        }
    }

    return varianceWeighted;
}

template <class T>
inline float    StdDevWeighted( T samples[], T weights[], int count )
{
    return sqrtf( VarianceWeighted( samples, weights, count ) );
}

測試用例:

int     samples[] = { 2, 3, 5, 7, 11, 13, 17, 19, 23 };

printf( "%.2f\n", StdDev( samples, 9 ) );

int     weights[] = { 1, 1, 0, 0, 4, 1, 2, 1, 0 };

printf( "%.2f\n", StdDevWeighted( samples, weights, 9 ) );

結果:

7.46
1.94

應該:

7.46
5.82

我認為問題是加權方差有一些不同的解釋,我不知道哪一個是標准的。 我發現了這種變化:

http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Weighted_incremental_algorithm

template <class T>
inline float    VarianceWeighted( T samples[], T weights[], int count )
{
    float   varianceWeighted = 0.0f;

    if( count > 1 )
    {
        float   sumWeights = 0.0f, meanWeighted = 0.0f, m2 = 0.0f;

        for( int i = 0; i < count; i++ )
        {
            float   temp = weights[i] + sumWeights,
                    delta = samples[i] - meanWeighted,
                    r = delta*weights[i]/temp;

            meanWeighted += r;
            m2 += sumWeights*delta*r;   // Alternatively, m2 += weights[i] * delta * (samples[i]−meanWeighted)
            sumWeights = temp;
        }

        varianceWeighted = (m2/sumWeights)*((float) count/(count - 1));
    }

    return varianceWeighted;
}

結果:

7.46
5.64

我也嘗試過看看boost和esutil,但是他們沒有多大幫助:

http://www.boost.org/doc/libs/1_48_0/boost/accumulators/statistics/weighted_variance.hpp http://esutil.googlecode.com/svn-history/r269/trunk/esutil/stat/util.py

我不需要整個統計庫,更重要的是,我想了解實現。

有人可以發布功能來正確計算這些嗎?

如果您的功能可以一次性完成,則可以獲得獎勵積分。

此外,是否有人知道加權方差是否與重復值的普通方差給出相同的結果? 例如,樣本[] = {1,2,3,3}的方差是否與樣本的加權方差相同[] = {1,2,3},權重[] = {1,1,2} ?

更新:這是我設置的谷歌文檔電子表格來探索問題。 不幸的是,我的答案與NIST pdf無關。 我認為問題出在unbias步驟,但我看不出如何修復它。

https://docs.google.com/spreadsheet/ccc?key=0ApzPh5nRin0ldGNNYjhCUTlWTks2TGJrZW4wQUcyZnc&usp=sharing

結果是加權方差為3.77,這是我在c ++代碼中得到的加權標准差為1.94的平方。

我正在嘗試在我的Mac OS X設置上安裝八度音,這樣我就可以使用權重運行他們的var()函數,但是它需要永遠用brew安裝它。 我現在非常喜歡氂牛皮

float mean(uint16_t* x, uint16_t n) {
    uint16_t sum_xi = 0;
    int i;
    for (i = 0; i < n; i++) {
        sum_xi += x[i];
    }
    return (float) sum_xi / n;
}

/**
 * http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weigmean.pdf
 */
float weighted_mean(uint16_t* x, uint16_t* w, uint16_t n) {
    int sum_wixi = 0;
    int sum_wi = 0;
    int i;
    for (i = 0; i < n; i++) {
        sum_wixi += w[i] * x[i];
        sum_wi += w[i];
    }
    return (float) sum_wixi / (float) sum_wi;
}

float variance(uint16_t* x, uint16_t n) {
    float mean_x = mean(x, n);
    float dist, dist2;
    float sum_dist2 = 0;

    int i;
    for (i = 0; i < n; i++) {
        dist = x[i] - mean_x;
        dist2 = dist * dist;
        sum_dist2 += dist2;
    }

    return sum_dist2 / (n - 1);
}

/**
 * http://www.itl.nist.gov/div898/software/dataplot/refman2/ch2/weighvar.pdf
 */
float weighted_variance(uint16_t* x, uint16_t* w, uint16_t n) {
    float xw = weighted_mean(x, w, n);
    float dist, dist2;
    float sum_wi_times_dist2 = 0;
    int sum_wi = 0;
    int n_prime = 0;

    int i;
    for (i = 0; i < n; i++) {
        dist = x[i] - xw;
        dist2 = dist * dist;
        sum_wi_times_dist2 += w[i] * dist2;
        sum_wi += w[i];

        if (w[i] > 0)
            n_prime++;
    }

    if (n_prime > 1) {
        return sum_wi_times_dist2 / ((float) ((n_prime - 1) * sum_wi) / n_prime);
    } else {
        return 0.0f;
    }
}

/**
 * http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Weighted_incremental_algorithm
 */
float weighted_incremental_variance(uint16_t* x, uint16_t* w, uint16_t n) {
    uint16_t sumweight = 0;
    float mean = 0;
    float M2 = 0;
    int n_prime = 0;

    uint16_t temp;
    float delta;
    float R;

    int i;
    for (i = 0; i < n; i++) {
        if (w[i] == 0)
            continue;

        temp = w[i] + sumweight;
        delta = x[i] - mean;
        R = delta * w[i] / temp;
        mean += R;
        M2 += sumweight * delta * R;
        sumweight = temp;

        n_prime++;
    }

    if (n_prime > 1) {
        float variance_n = M2 / sumweight;
        return variance_n * n_prime / (n_prime - 1);
    } else {
        return 0.0f;
    }
}

void main(void) {
    uint16_t n = 9;
    uint16_t x[] = { 2, 3, 5, 7, 11, 13, 17, 19, 23 };
    uint16_t w[] = { 1, 1, 0, 0,  4,  1,  2,  1,  0 };

    printf("%f\n", weighted_variance(x, w, n)); /* outputs: 33.900002 */
    printf("%f\n", weighted_incremental_variance(x, w, n)); /* outputs: 33.900005 */
}

您不小心在方差更新術語的分母中添加了一個額外的術語“ 計數 ”。

當使用下面的更正我得到您的預期答案

5.82

僅供參考,在進行代碼審查時,采用這種方法的一種方法是進行“尺寸分析”。 等式的“單位”是錯誤的。 當它應該是一個N階項時,你實際上除以了一個N階平方項。

之前

template <class T>
inline float    VarianceWeighted( T samples[], T weights[], int count )
{
    ...
            varianceWeighted *= ((float) numNonzero)/((float) count*(numNonzero - 1.0f)*sumWeights);    // this should be right but isn't?!
    ...
}

刪除“ 計數 ”此行應替換為

template <class T>
inline float    VarianceWeighted( T samples[], T weights[], int count )
{
    ...
            varianceWeighted *= ((float) numNonzero)/((float) (numNonzero - 1.0f)*sumWeights);  // removed count term
    ...
}

這是一個使用Demo的更短的版本:

 #include <iostream>
 #include <vector>
 #include <boost/accumulators/accumulators.hpp>
 #include <boost/accumulators/statistics/stats.hpp>
 #include <boost/accumulators/statistics/weighted_variance.hpp>
 #include <boost/accumulators/statistics/variance.hpp>

 namespace ba = boost::accumulators;

 int main() {
     std::vector<double> numbers{2, 3, 5, 7, 11, 13, 17, 19, 23};
     std::vector<double> weights{1, 1, 0, 0,  4,  1,  2,  1, 0 };

     ba::accumulator_set<double, ba::stats<ba::tag::variance          >          > acc;
     ba::accumulator_set<double, ba::stats<ba::tag::weighted_variance > , double > acc_weighted;

     double n = numbers.size();
     double N = n;

     for(size_t i = 0 ; i<numbers.size() ; i++ ) {
         acc         ( numbers[i] );
         acc_weighted( numbers[i] ,   ba::weight = weights[i] );
         if(weights[i] == 0) {
             n=n-1;
         }
     };

     std::cout << "Sample Standard Deviation, s: "          << std::sqrt(ba::variance(acc)                  *N/(N-1))        << std::endl;
     std::cout << "Weighted Sample Standard Deviation, s: " << std::sqrt(ba::weighted_variance(acc_weighted)*n/(n-1))        << std::endl;
 }

請注意, n必須反映具有非零權重的樣本數,因此額外n=n-1; 線。

Sample Standard Deviation, s: 7.45729
Weighted Sample Standard Deviation, s: 5.82237

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM