I need to calculate confidence interval, max confidence interval and min confidence interval for my dataframe over mean value of value3 column and I need to apply it on all my dataframe. Here is my dataframe:
+--------+---------+------+
| value1| value2 |value3|
+--------+---------+------+
| a | 2 | 3 |
+--------+---------+------+
| b | 5 | 4 |
+--------+---------+------+
| b | 5 | 4 |
+--------+---------+------+
| c | 3 | 4 |
+--------+---------+------+
So my output should be something like below (x is the result of calculation):
+--------+---------+------+-------+--------+----------+
| value1| value2 |value3|max_int|min_int | int | |
+--------+---------+------+-------+--------+----------+
| a | 2 | 3 | x | x | x |
+--------+---------+------+-------+--------+----------+
| b | 5 | 4 | x | x | x |
+--------+---------+------+-------+--------+----------+
| b | 5 | 4 | x | x | x |
+--------+---------+------+-------+--------+----------+
| c | 3 | 4 | x | x | x |
+--------+---------+------+-------+--------+----------+
Since I could't find a built-in function for it so, I found the following function to do that. Here is the code to calculate it.
import org.apache.commons.math3.distribution.TDistribution
import org.apache.commons.math3.exception.MathIllegalArgumentException
import org.apache.commons.math3.stat.descriptive.SummaryStatistics
import scala.collection.JavaConversions._
object ConfidenceIntervalApp {
def main(args: Array[String]): Unit = {
///my dataframe name is df
}
// Calculate 95% confidence interval
val ci: Double = calcMeanCI(stats, 0.95)
println(String.format("Mean: %f", stats.getMean))
val lower: Double = stats.getMean - ci
val upper: Double = stats.getMean + ci
}
def calcMeanCI(stats:Rdd, level: Double): Double =
try {
// Create T Distribution with N-1 degrees of freedom
val tDist: TDistribution = new TDistribution(stats.getN - 1)
// Calculate critical value
val critVal: Double =
tDist.inverseCumulativeProbability(1.0 - (1 - level) / 2)
// Calculate confidence interval
critVal * stats.getStandardDeviation / Math.sqrt(stats.getN)
} catch {
case e: MathIllegalArgumentException => java.lang.Double.NaN
}
}
Could you help or at least guid me how to apply it on columns. Thanks in advance.
Can you help me?
you can do something like
val cntInterval = df.select("value3").rdd.countApprox(timeout = 1000L,confidence = 0.95)
val (lowCnt,highCnt) = (cntInterval.getFinalValue().low, cntInterval.getFinalValue().high)
df.withColumn("max_int", lit(highCnt))
.withColumn("min_int", lit(lowCnt))
.withColumn("int", lit(cntInterval.getFinalValue().toString()))
.show(false)
I took help from In spark, how to estimate the number of elements in a dataframe quickly
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.