使用C ++或Scilab或Octave或R中的大量数据进行统计

Question

I recently need to calculate the mean and standard deviation of a large number (about 800,000,000) of doubles. 我最近需要计算大量（大约800,000,000）双打的平均值和标准差。 Considering that a double takes 8 bytes, if all the doubles are read into ram, it will take about 6 GB. 考虑到double需要8个字节，如果将所有双精度数读入ram，则需要大约6 GB。 I think I can use a divide and conquer approach with C++ or other high level languages, but that seems tedious. 我想我可以使用C ++或其他高级语言进行分而治之的方法，但这看起来很单调乏味。 Is there a way that I can do this all at once with high level languages like R, Scilab or Octave? 有没有办法可以使用R，Scilab或Octave等高级语言同时完成所有这些操作？ Thanks. 谢谢。

Answer 1

It sounds like you could use R-Grid or Hadoop to good advantage. 听起来你可以使用R-Grid或Hadoop来获得优势。

You realize, of course, that it's easy to calculate both the mean and standard deviation without having to read all the values into memory. 当然，您会意识到，无需将所有值读入内存即可轻松计算均值和标准偏差。 Just keep a running total, like this Java class does. 只需保持一个运行总计，就像这个Java类一样。 All you need is the total sum, the total sum of squares, and the number of points. 您所需要的只是总和，平方和和点数。 I keep the min and max for free. 我保持最小和最大免费。

This also makes clear how map-reduce would work. 这也清楚地说明了map-reduce的工作原理。 You'd instantiate several instances of Statistics, let each of them keep sum, sum of squares, and number of points for their portion of the 800M points. 您将实例化几个统计实例，让每个实例保持800M点的部分的总和，平方和和点数。 Then let the reduce step combine them and use the same formulas to get the final result. 然后让reduce步骤将它们组合起来并使用相同的公式来获得最终结果。

import org.apache.commons.lang3.StringUtils;

import java.util.Collection;

/**
 * Statistics accumulates simple statistics for a given quantity "on the fly" - no array needed.
 * Resets back to zero when adding a value will overflow the sum of squares.
 * @author mduffy
 * @since 9/19/12 8:16 AM
 */
public class Statistics {
    private String quantityName;
    private int numValues;
    private double x;
    private double xsq;
    private double xmin;
    private double xmax;

    /**
     * Constructor
     */
    public Statistics() {
        this(null);
    }

    /**
     * Constructor
     * @param quantityName to describe the quantity (e.g. "heap size")
     */
    public Statistics(String quantityName) {
        this.quantityName = (StringUtils.isBlank(quantityName) ? "x" : quantityName);
        this.reset();
    }

    /**
     * Reset the object in the event of overflow by the sum of squares
     */
    public synchronized void reset() {
        this.numValues = 0;
        this.x = 0.0;
        this.xsq = 0.0;
        this.xmin = Double.MAX_VALUE;
        this.xmax = -Double.MAX_VALUE;
    }

    /**
     * Add a List of values
     * @param values to add to the statistics
     */
    public synchronized void addAll(Collection<Double> values) {
        for (Double value : values) {
            add(value);
        }
    }

    /**
     * Add an array of values
     * @param values to add to the statistics
     */
    public synchronized void allAll(double [] values) {
        for (double value : values) {
            add(value);
        }
    }

    /**
     * Add a value to current statistics
     * @param value to add for this quantity
     */
    public synchronized void add(double value) {
        double vsq = value*value;
        ++this.numValues;
        this.x += value;
        this.xsq += vsq; // TODO: how to detect overflow in Java?
        if (value < this.xmin) {
            this.xmin = value;
        }
        if (value > this.xmax) {
            this.xmax = value;
        }
    }

    /**
     * Get the current value of the mean or average
     * @return mean or average if one or more values have been added or zero for no values added
     */
    public synchronized double getMean() {
        double mean = 0.0;
        if (this.numValues > 0) {
            mean = this.x/this.numValues;
        }
        return mean;
    }

    /**
     * Get the current min value
     * @return current min value or Double.MAX_VALUE if no values added
     */
    public synchronized double getMin() {
        return this.xmin;
    }

    /**
     * Get the current max value
     * @return current max value or Double.MIN_VALUE if no values added
     */
    public synchronized double getMax() {
        return this.xmax;
    }

    /**
     * Get the current standard deviation
     * @return standard deviation for (N-1) dof or zero if one or fewer values added
     */
    public synchronized double getStdDev() {
        double stdDev = 0.0;
        if (this.numValues > 1) {
            stdDev = Math.sqrt((this.xsq-this.x*this.x/this.numValues)/(this.numValues-1));
        }
        return stdDev;
    }

    /**
     * Get the current number of values added
     * @return current number of values added or zero if overflow condition is encountered
     */
    public synchronized int getNumValues() {
        return this.numValues;
    }

    @Override
    public String toString() {
        final StringBuilder sb = new StringBuilder();
        sb.append("Statistics");
        sb.append("{quantityName='").append(quantityName).append('\'');
        sb.append(", numValues=").append(numValues);
        sb.append(", xmin=").append(xmin);
        sb.append(", mean=").append(this.getMean());
        sb.append(", std dev=").append(this.getStdDev());
        sb.append(", xmax=").append(xmax);
        sb.append('}');
        return sb.toString();
    }
}

And here's the JUnit test to prove that it's working: 这是JUnit测试，以证明它正在工作：

import org.junit.Assert;
import org.junit.Test;

import java.util.Arrays;
import java.util.List;

/**
 * StatisticsTest
 * @author mduffy
 * @since 9/19/12 11:21 AM
 */
public class StatisticsTest {
    public static final double TOLERANCE = 1.0e-4;

    @Test
    public void testAddAll() {
        // The test uses a full array, but it's obvious that you could read them from a file one at a time and process until you're done.
        List<Double> values = Arrays.asList( 2.0, 4.0, 4.0, 4.0, 5.0, 5.0, 7.0, 9.0 );
        Statistics stats = new Statistics();
        stats.addAll(values);
        Assert.assertEquals(8, stats.getNumValues());
        Assert.assertEquals(2.0, stats.getMin(), TOLERANCE);
        Assert.assertEquals(9.0, stats.getMax(), TOLERANCE);
        Assert.assertEquals(5.0, stats.getMean(), TOLERANCE);
        Assert.assertEquals(2.138089935299395, stats.getStdDev(), TOLERANCE);
    }
}

Answer 2

Not claiming that this is optimal, but in python (with numpy and numexpr modules) the following is easy (on 8G RAM machine): 没有声称这是最佳的，但在python（有numpy和numexpr模块）中，以下很容易（在8G RAM机器上）：

import numpy, numpy as np, numexpr
x = np.random.uniform(0, 1, size=8e8)

print x.mean(), (numexpr.evaluate('sum(x*x)')/len(x)-
                (numexpr.evaluate('sum(x)')/len(x))**2)**.5
>>> 0.499991593345 0.288682001731

This doesn't consume more memory than the original array. 这不会消耗比原始数组更多的内存。

Answer 3

This looks like a nice challenge, can't you create something similar with a tweaked mergesort? 这看起来是一个很好的挑战，难道你不能用调整后的mergesort创建类似的东西吗？ Just an idea. 只是一个想法。 However this looks like dynamic programming, you could use multiple PC's to make things faster. 然而，这看起来像动态编程，您可以使用多台PC来加快速度。

使用C ++或Scilab或Octave或R中的大量数据进行统计

问题描述

3 个解决方案

解决方案1
1 2012-10-03 16:31:22

解决方案2
1 2012-10-03 17:23:03

解决方案3
0 2012-10-03 16:30:41

使用C ++或Scilab或Octave或R中的大量数据进行统计

问题描述

3 个解决方案

解决方案1 1 2012-10-03 16:31:22

解决方案2 1 2012-10-03 17:23:03

解决方案3 0 2012-10-03 16:30:41

解决方案1
1 2012-10-03 16:31:22

解决方案2
1 2012-10-03 17:23:03

解决方案3
0 2012-10-03 16:30:41