I have a Spark dataframe named df
as input:
+---------------+---+---+---+---+
|Main_CustomerID| A1| A2| A3| A4|
+---------------+---+---+---+---+
| 101| 1| 0| 2| 1|
| 102| 0| 3| 1| 1|
| 103| 2| 1| 0| 0|
+---------------+---+---+---+---+
I need to collect the values of A1
, A2
, A3
, A4
into a mllib matrix such as,
dm: org.apache.spark.mllib.linalg.Matrix =
1.0 0.0 2.0 1.0
0.0 3.0 1.0 1.0
2.0 1.0 0.0 0.0
How can I achieve this in Scala?
You can do it as follows, first get all columns that should be included in the matrix:
import org.apache.spark.sql.functions._
val matrixColumns = df.columns.filter(_.startsWith("A")).map(col(_))
Then convert the dataframe to an RDD[Vector]
. Since the vector need to contain doubles this conversion need to be done here too.
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix}
val rdd = df.select(array(matrixColumns:_*).as("arr")).as[Array[Int]].rdd
.zipWithIndex()
.map{ case(arr, index) => IndexedRow(index, Vectors.dense(arr.map(_.toDouble)))}
Then convert the rdd to an IndexedRowMatrix
which can be converted, if required, to a local Matrix:
val dm = new IndexedRowMatrix(rdd).toBlockMatrix().toLocalMatrix()
For smaller matrices that can be collected to the driver there is an easier alternative:
val matrixColumns = df.columns.filter(_.startsWith("A")).map(col(_))
val arr = df.select(array(matrixColumns:_*).as("arr")).as[Array[Int]]
.collect()
.flatten
.map(_.toDouble)
val rows = df.count().toInt
val cols = matrixColumns.length
// It's necessary to reverse cols and rows here and then transpose
val dm = Matrices.dense(cols, rows, arr).transpose()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.