[英]columnSimilarities() back to Spark Data Frame
I have a second question around CosineSimilarity / ColumnSimilarities in Spark 2.1. 我在Spark 2.1中有关于CosineSimilarity / ColumnSimilarities的第二个问题。 I'm kinda new to scala and all the Spark environment and this is not really clear to me:
我对scala和所有Spark环境都不熟悉,这对我来说并不是很清楚:
How can I get back the ColumnSimilarities for each combination of columns from the rowMatrix in spark. 如何从spark中的rowMatrix中为每个列组合取回ColumnSimilarities。 Here is what I tried:
这是我尝试过的:
Data: 数据:
import org.apache.spark.sql.{SQLContext, Row, DataFrame}
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DoubleType}
import org.apache.spark.sql.functions._
// rdd
val rowsRdd: RDD[Row] = sc.parallelize(
Seq(
Row(2.0, 7.0, 1.0),
Row(3.5, 2.5, 0.0),
Row(7.0, 5.9, 0.0)
)
)
// Schema
val schema = new StructType()
.add(StructField("item_1", DoubleType, true))
.add(StructField("item_2", DoubleType, true))
.add(StructField("item_3", DoubleType, true))
// Data frame
val df = spark.createDataFrame(rowsRdd, schema)
Code: 码:
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.mllib.linalg.distributed.{MatrixEntry, CoordinateMatrix, RowMatrix}
val rows = new VectorAssembler().setInputCols(df.columns).setOutputCol("vs")
.transform(df)
.select("vs")
.rdd
val items_mllib_vector = rows.map(_.getAs[org.apache.spark.ml.linalg.Vector](0))
.map(org.apache.spark.mllib.linalg.Vectors.fromML)
val mat = new RowMatrix(items_mllib_vector)
val simsPerfect = mat.columnSimilarities()
println("Pairwise similarities are: " + simsPerfect.entries.collect.mkString(", "))
Output: 输出:
Pairwise similarities are: MatrixEntry(0,2,0.24759378423606918), MatrixEntry(1,2,0.7376189553526812), MatrixEntry(0,1,0.8355316482961213)
So What I get is simsPerfect org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
of my Columns and similarities. 所以我得到的是simsPerfect
org.apache.spark.mllib.linalg.distributed.CoordinateMatrix
我的列和相似之处。 How would I transform this back to a dataframe and get the right columns names with it? 我如何将其转换回数据帧并获得正确的列名称?
My preferred output: 我的首选输出:
item_from | item_to | similarity
1 | 2 | 0.83 |
1 | 3 | 0.24 |
2 | 3 | 0.73 |
Thanks in advance 提前致谢
This approach also works without converting the row to String: 此方法也可以在不将行转换为String的情况下工作:
val transformedRDD = simsPerfect.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => (row,col,sim)}
val dff = sqlContext.createDataFrame(transformedRDD).toDF("item_from", "item_to", "sim")
where, I assume val sqlContext = new org.apache.spark.sql.SQLContext(sc)
is defined already and sc
is the SparkContext. 其中,我假设
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
已经定义, sc
是SparkContext。
I found a solution for my problem: 我找到了解决问题的方法:
//Transform result to rdd
val transformedRDD = simsPerfect.entries.map{case MatrixEntry(row: Long, col:Long, sim:Double) => Array(row,col,sim).mkString(",")}
//Transform rdd[String] to rdd[Row]
val rdd2 = transformedRDD.map(a => Row(a))
// to DF
val dfschema = StructType(Array(StructField("value",StringType)))
val rddToDF = spark.createDataFrame(rdd2,dfschema)
//create new DF with schema
val newdf = rddToDF.select(expr("(split(value, ','))[0]").cast("string").as("item_from")
,expr("(split(value, ','))[1]").cast("string").as("item_to")
,expr("(split(value, ','))[2]").cast("string").as("sim"))
I'm sure there is another easier way to do this, but I'm happy that it works. 我确信还有另一种更简单的方法可以做到这一点,但我很高兴它能奏效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.