[英]Create a dataframe with SparseVector PySpark
Let's say I have a Spark dataframe that looks like this假设我有一个看起来像这样的 Spark dataframe
Row(Y=a, X1=3.2, X2=4.5)
What I'd want is:我想要的是:
Row(Y=a, features=SparseVector(2, {X1: 3.2, X2: 4.5})
Perhaps this is helpful-也许这有帮助-
Written in scala but can be implemented in pyspark with minimal change
用 scala 编写,但可以在 pyspark 中实现,只需极少的更改
val df = spark.sql("select 'a' as Y, 3.2 as X1, 4.5 as X2")
df.show(false)
df.printSchema()
/**
* +---+---+---+
* |Y |X1 |X2 |
* +---+---+---+
* |a |3.2|4.5|
* +---+---+---+
*
* root
* |-- Y: string (nullable = false)
* |-- X1: decimal(2,1) (nullable = false)
* |-- X2: decimal(2,1) (nullable = false)
*/
import org.apache.spark.ml.feature.VectorAssembler
val features = new VectorAssembler()
.setInputCols(Array("X1", "X2"))
.setOutputCol("features")
.transform(df)
features.show(false)
features.printSchema()
/**
* +---+---+---+---------+
* |Y |X1 |X2 |features |
* +---+---+---+---------+
* |a |3.2|4.5|[3.2,4.5]|
* +---+---+---+---------+
*
* root
* |-- Y: string (nullable = false)
* |-- X1: decimal(2,1) (nullable = false)
* |-- X2: decimal(2,1) (nullable = false)
* |-- features: vector (nullable = true)
*/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.