Scala：如何获得矩阵的均值、方差和协方差？

Question

I am new to scala and I desperately need some guidance on the following problem:我是 scala 的新手，我迫切需要一些关于以下问题的指导：

I have a dataframe like the one below (some elements may be NULL)我有一个 dataframe 如下图所示（某些元素可能为 NULL）

val df = Seq(
  (1, 1, 1, 3),
  (1, 2, 0, 0),
  (1, 3, 1, 1),
  (1, 4, 0, 2),
  (1, 5, 0, 1),
  (2, 1, 1, 3),
  (2, 2, 1, 1),
  (2, 3, 0, 0)).toDF("m1", "m2", "m3", "m4")

df.show
+---+---+---+---+
| m1| m2| m3| m4|
+---+---+---+---+
|  1|  1|  1|  3|
|  1|  2|  0|  0|
|  1|  3|  1|  1|
|  1|  4|  0|  2|
|  1|  5|  0|  1|
|  2|  1|  1|  3|
|  2|  2|  1|  1|
|  2|  3|  0|  0|
+---+---+---+---+

I need to get the following statistics out of this dataframe:我需要从这个 dataframe 中得到以下统计数据：

a vector that contains the mean of each column (some elements might be NULL and I want to calculate the mean using only the non-NULL elements);包含每列平均值的向量（某些元素可能是 NULL，我想仅使用非 NULL 元素计算平均值）； I would also like to refer to each element of the vector by name for example, vec_mean["m1_mean"] would return the first element我还想按名称引用向量的每个元素，例如， vec_mean["m1_mean"] 将返回第一个元素

vec_mean: Vector(m1_mean, m2_mean, m3_mean, m4_mean)

a variance-covariance matrix that is (4 x 4), where the diagonals are var(m1), var(m2),..., and the off-diagonals are cov(m1,m2), cov(m1,m3)... Here, I would also like to only use the non-NULL elements in the variance-covariance calculation一个方差-协方差矩阵，即 (4 x 4)，其中对角线是var(m1), var(m2),...,非对角线是cov(m1,m2), cov(m1,m3)...在这里，我还想在方差 - 协方差计算中只使用非 NULL 元素
A vector that contains the number of non-null for each column包含每列的非空数的向量

vec_n: Vector(m1_n, m2_n, m3_n, m4_n)

A vector that contains the standard deviation of each column包含每列标准差的向量

vec_stdev: Vector(m1_stde, m2_stde, m3_stde, m4_stde)

In R I would convert everything to a matrix and then the rest is easy.在 R 中，我会将所有内容转换为矩阵，然后 rest 很容易。 But in scala, I'm unfamiliar with matrices and there are apparently multiple types of matrices, which are confusing (DenseMatrix, IndexedMatrix, etc.)但是在 scala 中，我对矩阵不熟悉，而且显然有多种类型的矩阵，令人困惑（DenseMatrix、IndexedMatrix 等）

Answer 1

Yo can work with Spark RowMatrix.你可以使用 Spark RowMatrix。 It has these kind of operations like computing the co-variance matrix using each row as an observation, mean, varianze, etc... The only thing that you have to know is how to build it from a Dataframe.它具有此类操作，例如使用每行作为观察值、均值、方差等来计算协方差矩阵……您唯一需要知道的是如何从 Dataframe 构建它。

It turns out that a Dataframe in Spark contains a schema that represents the type of information that can be stored in it, and it is not only floating point numbers arrays.事实证明，Spark 中的 Dataframe 包含一个模式，表示可以存储在其中的信息类型，而不仅仅是浮点数 arrays。 So the first thing is to transform this DF to a RDD of vectors(dense vector in this case).所以第一件事是将这个DF转换为向量的RDD（在这种情况下为密集向量）。

Having this DF:拥有这个 DF：

  val df = Seq(
    (1, 1, 1, 3),
    (1, 2, 0, 0),
    (1, 3, 1, 1),
    (1, 4, 0, 2),
    (1, 5, 0, 1),
    (2, 1, 1, 3),
    (2, 2, 1, 1),
    (2, 3, 0, 0)).toDF("m1", "m2", "m3", "m4")

Convert it to a RDD Row[DenseVector] representation.将其转换为 RDD Row[DenseVector] 表示。 There must be dozens of ways of doing this.一定有几十种方法可以做到这一点。 One could be:一种可能是：

val rdd = df.rdd.map {
  case a: Row =>
    (0 until a.length).foldRight(Array[Int]())((b, acc) => {
      val k = a.getAs[Int](b)
      if(k == null) acc.+:(0) else acc.+:(k)
    }).map(_.toDouble)
}

As you can see in your IDE, the inferred type is RDD[Array[Float] .正如您在 IDE 中看到的，推断的类型是RDD[Array[Float] 。 Now convert this to a RDD[DenseVector] .现在将其转换为RDD[DenseVector] 。 As simple as doing:就像这样做一样简单：

val rowsRdd = rdd.map(Vectors.dense(_))

And now you can build your Matrix:现在您可以构建您的矩阵：

val mat: RowMatrix = new RowMatrix(rowsRdd)

Once you have the matrix, you can easily compute the different metrix per column:获得矩阵后，您可以轻松计算每列的不同矩阵：

println("Mean: " + mat.computeColumnSummaryStatistics().mean)
println("Variance: " + mat.computeColumnSummaryStatistics().variance)

It gives:它给：

Mean: [1.375,2.625,0.5,1.375]

Variance: 
[0.26785714285714285,1.9821428571428572,0.2857142857142857,1.4107142857142858]

you can read more info about the capabilities of Spark and these distributed types in the doc: https://spark.apache.org/docs/latest/mllib-data-types.html#data-types-rdd-based-api您可以在文档中阅读有关 Spark 和这些分布式类型的功能的更多信息： https://spark.apache.org/docs/latest/mllib-data-types.html#data-types-rdd-based-api

You can also compute the Covariance matrix, doing the SVD, etc...您还可以计算协方差矩阵、进行 SVD 等...

Scala：如何获得矩阵的均值、方差和协方差？

问题描述

1 个解决方案

解决方案1
0 2021-12-19 12:28:36

Scala：如何获得矩阵的均值、方差和协方差？

问题描述

1 个解决方案

解决方案1 0 2021-12-19 12:28:36

解决方案1
0 2021-12-19 12:28:36