[英]How to convert Spark Dense Matrix to a Spark Dataframe
我正在尝试在 Scala Spark 中实现一些代码,其中我有一个多类逻辑回归 Model 和 model 生成一个系数矩阵。
这是代码 -
val training = spark.read.format("libsvm").load("data/mllib/sample_multiclass_classification_data.txt")
training.show(false)
+-----+-----------------------------------------------------------+
|label|features |
+-----+-----------------------------------------------------------+
|1.0 |(4,[0,1,2,3],[-0.222222,0.5,-0.762712,-0.833333]) |
|1.0 |(4,[0,1,2,3],[-0.555556,0.25,-0.864407,-0.916667]) |
|1.0 |(4,[0,1,2,3],[-0.722222,-0.166667,-0.864407,-0.833333]) |
|1.0 |(4,[0,1,2,3],[-0.722222,0.166667,-0.694915,-0.916667]) |
|0.0 |(4,[0,1,2,3],[0.166667,-0.416667,0.457627,0.5]) |
|1.0 |(4,[0,2,3],[-0.833333,-0.864407,-0.916667]) |
|2.0 |(4,[0,1,2,3],[-1.32455E-7,-0.166667,0.220339,0.0833333]) |
|2.0 |(4,[0,1,2,3],[-1.32455E-7,-0.333333,0.0169491,-4.03573E-8])|
|1.0 |(4,[0,1,2,3],[-0.5,0.75,-0.830508,-1.0]) |
|0.0 |(4,[0,2,3],[0.611111,0.694915,0.416667]) |
|0.0 |(4,[0,1,2,3],[0.222222,-0.166667,0.423729,0.583333]) |
|1.0 |(4,[0,1,2,3],[-0.722222,-0.166667,-0.864407,-1.0]) |
|1.0 |(4,[0,1,2,3],[-0.5,0.166667,-0.864407,-0.916667]) |
|2.0 |(4,[0,1,2,3],[-0.222222,-0.333333,0.0508474,-4.03573E-8]) |
|2.0 |(4,[0,1,2,3],[-0.0555556,-0.833333,0.0169491,-0.25]) |
|2.0 |(4,[0,1,2,3],[-0.166667,-0.416667,-0.0169491,-0.0833333]) |
|1.0 |(4,[0,2,3],[-0.944444,-0.898305,-0.916667]) |
|2.0 |(4,[0,1,2,3],[-0.277778,-0.583333,-0.0169491,-0.166667]) |
|0.0 |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667]) |
|2.0 |(4,[0,1,2,3],[-0.222222,-0.166667,0.0847457,-0.0833333]) |
+-----+-----------------------------------------------------------+
我正在尝试为 3 个标签安装 model。
scala> training.select("label").distinct.show
+-----+
|label|
+-----+
| 0.0|
| 1.0|
| 2.0|
+-----+
拟合逻辑回归 Model
import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)
现在,当我尝试查看系数矩阵时,它给了我一个具有 3 行(用于 3 个标签)和 4 列(4 个输入特征)的矩阵
scala> lrModel.coefficientMatrix.toDense
res13: org.apache.spark.ml.linalg.DenseMatrix =
0.0 0.0 0.0 0.3176483191238039
0.0 0.0 -0.7803943459681859 -0.3769611423403096
0.0 0.0 0.0 0.0
这是每个标签的截距 -
scala> lrModel.interceptVector
res15: org.apache.spark.ml.linalg.Vector = [0.05165231659832854,-0.12391224990853622,0.07225993331020768]
我想使用系数矩阵和截距向量创建一个特征重要性Spark dataframe以获得最终结果 dataframe 像这样 -
label feature name coefficient intercept
0 0 0 0.051
0 1 0 0.051
0 2 0 0.051
0 3 0.3176 0.051
1 0 0 -0.123
1 1 0 -0.123
1 2 -0.78 -0.123
1 3 -0.37 -0.123
2 0 0 0.072
2 1 0 0.072
2 2 0 0.072
2 3 0 0.072
每个特征对每个 label 都有一个系数,因此 output 中的总记录将是labels * features
,即3 * 4 = 12
我希望这个过程是动态的,将其包装在 function 中,以便我可以将其重新用于任意数量的功能和标签。
我正在从这里读取数据 - https://github.com/apache/spark/blob/master/data/mllib/sample_multiclass_classification_data.txt
我假设这里 lr 是以你的 pyspark 的逻辑回归为例。 下面的代码我尝试了多项逻辑回归
weights=lr.coefficientMatrix
rows = weights.toArray().tolist()
df = spark.createDataFrame(rows,["<your list of features columns used for training"])
上面的代码对我有用,不用担心序列分配你训练过的顺序
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.