简体   繁体   中英

Calculate Confidence interval over mean value for all the rows of a dataframe in Spark / Scala

I need to calculate confidence interval, max confidence interval and min confidence interval for my dataframe over mean value of value3 column and I need to apply it on all my dataframe. Here is my dataframe:

+--------+---------+------+
|  value1| value2  |value3|
+--------+---------+------+
|   a    |  2      |   3  |
+--------+---------+------+
|   b    |  5      |   4  |
+--------+---------+------+
|   b    |  5      |   4  |
+--------+---------+------+
|   c    |  3      |   4  |
+--------+---------+------+ 

So my output should be something like below (x is the result of calculation):

    +--------+---------+------+-------+--------+----------+
    |  value1| value2  |value3|max_int|min_int |    int   |      |
    +--------+---------+------+-------+--------+----------+
    |   a    |  2      |   3  |   x   |   x    |     x    |
    +--------+---------+------+-------+--------+----------+
    |   b    |  5      |   4  |   x   |   x    |     x    |
    +--------+---------+------+-------+--------+----------+
    |   b    |  5      |   4  |   x   |   x    |     x    |
    +--------+---------+------+-------+--------+----------+
    |   c    |  3      |   4  |   x   |   x    |     x    |
    +--------+---------+------+-------+--------+----------+

Since I could't find a built-in function for it so, I found the following function to do that. Here is the code to calculate it.

    import org.apache.commons.math3.distribution.TDistribution
    import org.apache.commons.math3.exception.MathIllegalArgumentException
    import org.apache.commons.math3.stat.descriptive.SummaryStatistics
    import scala.collection.JavaConversions._

    object ConfidenceIntervalApp {

      def main(args: Array[String]): Unit = {

    ///my dataframe name is df

        }
    // Calculate 95% confidence interval
        val ci: Double = calcMeanCI(stats, 0.95)
        println(String.format("Mean: %f", stats.getMean))
        val lower: Double = stats.getMean - ci
        val upper: Double = stats.getMean + ci

      }
      def calcMeanCI(stats:Rdd, level: Double): Double =
        try {
    // Create T Distribution with N-1 degrees of freedom
          val tDist: TDistribution = new TDistribution(stats.getN - 1)
    // Calculate critical value
          val critVal: Double =
            tDist.inverseCumulativeProbability(1.0 - (1 - level) / 2)
    // Calculate confidence interval
          critVal * stats.getStandardDeviation / Math.sqrt(stats.getN)
        } catch {
          case e: MathIllegalArgumentException => java.lang.Double.NaN

        }

}

Could you help or at least guid me how to apply it on columns. Thanks in advance.

Can you help me?

you can do something like

val cntInterval = df.select("value3").rdd.countApprox(timeout = 1000L,confidence = 0.95)
val (lowCnt,highCnt) = (cntInterval.getFinalValue().low, cntInterval.getFinalValue().high)

df.withColumn("max_int", lit(highCnt))
  .withColumn("min_int", lit(lowCnt))
  .withColumn("int", lit(cntInterval.getFinalValue().toString()))
  .show(false)

I took help from In spark, how to estimate the number of elements in a dataframe quickly

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM