[英]How to get the Spark scala correlation output as a dataframe?
I am trying to calculate correlation for all columns in a Spark dataframe using the below code.我正在尝试使用以下代码计算 Spark dataframe 中所有列的相关性。
import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.feature.VectorAssembler
val spark = SparkSession
.builder
.appName("SparkCorrelation")
.master("local[*]")
.getOrCreate()
import spark.implicits._
val df = Seq(
(0.1, 0.3, 0.5),
(0.2, 0.4, 0.6),
).toDF("c1", "c2", "c3")
val assembler = new VectorAssembler()
.setInputCols(Array("c1", "c2", "c3"))
.setOutputCol("vectors")
val transformed = assembler.transform(df)
val corr = Correlation.corr(transformed, "vectors","pearson")
corr.show(100,false)
My output comes out as a dataframe with one column.我的 output 以 dataframe 的形式出现,带有一列。
pearson(vectors)皮尔逊(矢量图) |
---|
1.0 1.0000000000000002 0.9999999999999998 \n1.0000000000000002 1.0 1.0000000000000002 \n0.9999999999999998 1.0000000000000002 1.0 1.0 1.00000000000000002 0.9999999999999998 \n1.0000000000000002 1.0 1.0000000000000002 \n0.9999999999999998 1.0000000000000002 1. |
but I want my output in the following format.但我希望我的 output 采用以下格式。 Can somebody please help?有人可以帮忙吗?
Column柱子 | c1 c1 | c2 c2 | c3 c3 |
---|---|---|---|
c1 c1 | 1 1 | 0.97 0.97 | 0.92 0.92 |
c2 c2 | 0.97 0.97 | 1 1 | 0.94 0.94 |
c3 c3 | 0.92 0.92 | 0.94 0.94 | 1 1 |
Best you can do is this, but without cols:你能做的最好的就是这个,但没有cols:
val corr = Correlation.corr(transformed, "vectors", "pearson").head
println(s"Pearson correlation matrix:\n $corr")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.