简体   繁体   English

Scala:如何获得矩阵的均值、方差和协方差?

[英]Scala: how to get the mean and variance and covariance of a matrix?

I am new to scala and I desperately need some guidance on the following problem:我是 scala 的新手,我迫切需要一些关于以下问题的指导:

I have a dataframe like the one below (some elements may be NULL)我有一个 dataframe 如下图所示(某些元素可能为 NULL)

val df = Seq(
  (1, 1, 1, 3),
  (1, 2, 0, 0),
  (1, 3, 1, 1),
  (1, 4, 0, 2),
  (1, 5, 0, 1),
  (2, 1, 1, 3),
  (2, 2, 1, 1),
  (2, 3, 0, 0)).toDF("m1", "m2", "m3", "m4")

df.show
+---+---+---+---+
| m1| m2| m3| m4|
+---+---+---+---+
|  1|  1|  1|  3|
|  1|  2|  0|  0|
|  1|  3|  1|  1|
|  1|  4|  0|  2|
|  1|  5|  0|  1|
|  2|  1|  1|  3|
|  2|  2|  1|  1|
|  2|  3|  0|  0|
+---+---+---+---+

I need to get the following statistics out of this dataframe:我需要从这个 dataframe 中得到以下统计数据:

  1. a vector that contains the mean of each column (some elements might be NULL and I want to calculate the mean using only the non-NULL elements);包含每列平均值的向量(某些元素可能是 NULL,我想仅使用非 NULL 元素计算平均值); I would also like to refer to each element of the vector by name for example, vec_mean["m1_mean"] would return the first element我还想按名称引用向量的每个元素,例如, vec_mean["m1_mean"] 将返回第一个元素
vec_mean: Vector(m1_mean, m2_mean, m3_mean, m4_mean)
  1. a variance-covariance matrix that is (4 x 4), where the diagonals are var(m1), var(m2),..., and the off-diagonals are cov(m1,m2), cov(m1,m3)... Here, I would also like to only use the non-NULL elements in the variance-covariance calculation一个方差-协方差矩阵,即 (4 x 4),其中对角线是var(m1), var(m2),...,非对角线是cov(m1,m2), cov(m1,m3)...在这里,我还想在方差 - 协方差计算中只使用非 NULL 元素

  2. A vector that contains the number of non-null for each column包含每列的非空数的向量

vec_n: Vector(m1_n, m2_n, m3_n, m4_n)
  1. A vector that contains the standard deviation of each column包含每列标准差的向量
vec_stdev: Vector(m1_stde, m2_stde, m3_stde, m4_stde)

In R I would convert everything to a matrix and then the rest is easy.在 R 中,我会将所有内容转换为矩阵,然后 rest 很容易。 But in scala, I'm unfamiliar with matrices and there are apparently multiple types of matrices, which are confusing (DenseMatrix, IndexedMatrix, etc.)但是在 scala 中,我对矩阵不熟悉,而且显然有多种类型的矩阵,令人困惑(DenseMatrix、IndexedMatrix 等)

Yo can work with Spark RowMatrix.你可以使用 Spark RowMatrix。 It has these kind of operations like computing the co-variance matrix using each row as an observation, mean, varianze, etc... The only thing that you have to know is how to build it from a Dataframe.它具有此类操作,例如使用每行作为观察值、均值、方差等来计算协方差矩阵……您唯一需要知道的是如何从 Dataframe 构建它。

It turns out that a Dataframe in Spark contains a schema that represents the type of information that can be stored in it, and it is not only floating point numbers arrays.事实证明,Spark 中的 Dataframe 包含一个模式,表示可以存储在其中的信息类型,而不仅仅是浮点数 arrays。 So the first thing is to transform this DF to a RDD of vectors(dense vector in this case).所以第一件事是将这个DF转换为向量的RDD(在这种情况下为密集向量)。

Having this DF:拥有这个 DF:

  val df = Seq(
    (1, 1, 1, 3),
    (1, 2, 0, 0),
    (1, 3, 1, 1),
    (1, 4, 0, 2),
    (1, 5, 0, 1),
    (2, 1, 1, 3),
    (2, 2, 1, 1),
    (2, 3, 0, 0)).toDF("m1", "m2", "m3", "m4")

Convert it to a RDD Row[DenseVector] representation.将其转换为 RDD Row[DenseVector] 表示。 There must be dozens of ways of doing this.一定有几十种方法可以做到这一点。 One could be:一种可能是:

val rdd = df.rdd.map {
  case a: Row =>
    (0 until a.length).foldRight(Array[Int]())((b, acc) => {
      val k = a.getAs[Int](b)
      if(k == null) acc.+:(0) else acc.+:(k)
    }).map(_.toDouble)
}

As you can see in your IDE, the inferred type is RDD[Array[Float] .正如您在 IDE 中看到的,推断的类型是RDD[Array[Float] Now convert this to a RDD[DenseVector] .现在将其转换为RDD[DenseVector] As simple as doing:就像这样做一样简单:

val rowsRdd = rdd.map(Vectors.dense(_))

And now you can build your Matrix:现在您可以构建您的矩阵:

val mat: RowMatrix = new RowMatrix(rowsRdd)

Once you have the matrix, you can easily compute the different metrix per column:获得矩阵后,您可以轻松计算每列的不同矩阵:

println("Mean: " + mat.computeColumnSummaryStatistics().mean)
println("Variance: " + mat.computeColumnSummaryStatistics().variance)

It gives:它给:

Mean: [1.375,2.625,0.5,1.375]

Variance: 
[0.26785714285714285,1.9821428571428572,0.2857142857142857,1.4107142857142858]

you can read more info about the capabilities of Spark and these distributed types in the doc: https://spark.apache.org/docs/latest/mllib-data-types.html#data-types-rdd-based-api您可以在文档中阅读有关 Spark 和这些分布式类型的功能的更多信息: https://spark.apache.org/docs/latest/mllib-data-types.html#data-types-rdd-based-api

You can also compute the Covariance matrix, doing the SVD, etc...您还可以计算协方差矩阵、进行 SVD 等...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 R 中的成对相关数据构建方差-协方差矩阵 - constructing variance-covariance matrix from pairwise correlation data in R 如何在滚动的基础上从 dataframe 中获取逆协方差矩阵 - How to get an inverted covariance matrix from a dataframe on a rolling basis 如何用 Pandas 计算协方差矩阵 - How to calculate covariance Matrix with Pandas 将相关性和波动性数据框与多索引相乘以获得协方差矩阵 - Multiply Correlation and Volatility Dataframes with Multi-Index to Get Covariance Matrix 如何获得数据帧内一列矩阵的矩阵列的平均值? - How to get mean of matrix columns for a column of matrices inside dataframe? r 协方差矩阵和相关矩阵 - r covariance matrix and correlation matrix 如何计算熊猫日期时间对象的均值和方差? - How to calculate mean and variance from pandas datetime object? 如何在R中安排RPostgreSQL查询并创建协方差矩阵 - How to arrange RPostgreSQL query and create covariance matrix in R 如何获得整个矩阵、数组或数据框的均值、中位数和其他统计数据? - How to get mean, median, and other statistics over entire matrix, array or dataframe? 如何计算相似矩阵的均值和标准差? - How to calculate the mean and standard deviation of similarity matrix?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM