简体   繁体   中英

Scala: how to get the mean and variance and covariance of a matrix?

I am new to scala and I desperately need some guidance on the following problem:

I have a dataframe like the one below (some elements may be NULL)

val df = Seq(
  (1, 1, 1, 3),
  (1, 2, 0, 0),
  (1, 3, 1, 1),
  (1, 4, 0, 2),
  (1, 5, 0, 1),
  (2, 1, 1, 3),
  (2, 2, 1, 1),
  (2, 3, 0, 0)).toDF("m1", "m2", "m3", "m4")

df.show
+---+---+---+---+
| m1| m2| m3| m4|
+---+---+---+---+
|  1|  1|  1|  3|
|  1|  2|  0|  0|
|  1|  3|  1|  1|
|  1|  4|  0|  2|
|  1|  5|  0|  1|
|  2|  1|  1|  3|
|  2|  2|  1|  1|
|  2|  3|  0|  0|
+---+---+---+---+

I need to get the following statistics out of this dataframe:

  1. a vector that contains the mean of each column (some elements might be NULL and I want to calculate the mean using only the non-NULL elements); I would also like to refer to each element of the vector by name for example, vec_mean["m1_mean"] would return the first element
vec_mean: Vector(m1_mean, m2_mean, m3_mean, m4_mean)
  1. a variance-covariance matrix that is (4 x 4), where the diagonals are var(m1), var(m2),..., and the off-diagonals are cov(m1,m2), cov(m1,m3)... Here, I would also like to only use the non-NULL elements in the variance-covariance calculation

  2. A vector that contains the number of non-null for each column

vec_n: Vector(m1_n, m2_n, m3_n, m4_n)
  1. A vector that contains the standard deviation of each column
vec_stdev: Vector(m1_stde, m2_stde, m3_stde, m4_stde)

In R I would convert everything to a matrix and then the rest is easy. But in scala, I'm unfamiliar with matrices and there are apparently multiple types of matrices, which are confusing (DenseMatrix, IndexedMatrix, etc.)

Yo can work with Spark RowMatrix. It has these kind of operations like computing the co-variance matrix using each row as an observation, mean, varianze, etc... The only thing that you have to know is how to build it from a Dataframe.

It turns out that a Dataframe in Spark contains a schema that represents the type of information that can be stored in it, and it is not only floating point numbers arrays. So the first thing is to transform this DF to a RDD of vectors(dense vector in this case).

Having this DF:

  val df = Seq(
    (1, 1, 1, 3),
    (1, 2, 0, 0),
    (1, 3, 1, 1),
    (1, 4, 0, 2),
    (1, 5, 0, 1),
    (2, 1, 1, 3),
    (2, 2, 1, 1),
    (2, 3, 0, 0)).toDF("m1", "m2", "m3", "m4")

Convert it to a RDD Row[DenseVector] representation. There must be dozens of ways of doing this. One could be:

val rdd = df.rdd.map {
  case a: Row =>
    (0 until a.length).foldRight(Array[Int]())((b, acc) => {
      val k = a.getAs[Int](b)
      if(k == null) acc.+:(0) else acc.+:(k)
    }).map(_.toDouble)
}

As you can see in your IDE, the inferred type is RDD[Array[Float] . Now convert this to a RDD[DenseVector] . As simple as doing:

val rowsRdd = rdd.map(Vectors.dense(_))

And now you can build your Matrix:

val mat: RowMatrix = new RowMatrix(rowsRdd)

Once you have the matrix, you can easily compute the different metrix per column:

println("Mean: " + mat.computeColumnSummaryStatistics().mean)
println("Variance: " + mat.computeColumnSummaryStatistics().variance)

It gives:

Mean: [1.375,2.625,0.5,1.375]

Variance: 
[0.26785714285714285,1.9821428571428572,0.2857142857142857,1.4107142857142858]

you can read more info about the capabilities of Spark and these distributed types in the doc: https://spark.apache.org/docs/latest/mllib-data-types.html#data-types-rdd-based-api

You can also compute the Covariance matrix, doing the SVD, etc...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM