簡體   English   中英

如何將 Spark Dense Matrix 轉換為 Spark Dataframe

[英]How to convert Spark Dense Matrix to a Spark Dataframe

我正在嘗試在 Scala Spark 中實現一些代碼,其中我有一個多類邏輯回歸 Model 和 model 生成一個系數矩陣。

這是代碼 -

val training = spark.read.format("libsvm").load("data/mllib/sample_multiclass_classification_data.txt")


training.show(false)
+-----+-----------------------------------------------------------+
|label|features                                                   |
+-----+-----------------------------------------------------------+
|1.0  |(4,[0,1,2,3],[-0.222222,0.5,-0.762712,-0.833333])          |
|1.0  |(4,[0,1,2,3],[-0.555556,0.25,-0.864407,-0.916667])         |
|1.0  |(4,[0,1,2,3],[-0.722222,-0.166667,-0.864407,-0.833333])    |
|1.0  |(4,[0,1,2,3],[-0.722222,0.166667,-0.694915,-0.916667])     |
|0.0  |(4,[0,1,2,3],[0.166667,-0.416667,0.457627,0.5])            |
|1.0  |(4,[0,2,3],[-0.833333,-0.864407,-0.916667])                |
|2.0  |(4,[0,1,2,3],[-1.32455E-7,-0.166667,0.220339,0.0833333])   |
|2.0  |(4,[0,1,2,3],[-1.32455E-7,-0.333333,0.0169491,-4.03573E-8])|
|1.0  |(4,[0,1,2,3],[-0.5,0.75,-0.830508,-1.0])                   |
|0.0  |(4,[0,2,3],[0.611111,0.694915,0.416667])                   |
|0.0  |(4,[0,1,2,3],[0.222222,-0.166667,0.423729,0.583333])       |
|1.0  |(4,[0,1,2,3],[-0.722222,-0.166667,-0.864407,-1.0])         |
|1.0  |(4,[0,1,2,3],[-0.5,0.166667,-0.864407,-0.916667])          |
|2.0  |(4,[0,1,2,3],[-0.222222,-0.333333,0.0508474,-4.03573E-8])  |
|2.0  |(4,[0,1,2,3],[-0.0555556,-0.833333,0.0169491,-0.25])       |
|2.0  |(4,[0,1,2,3],[-0.166667,-0.416667,-0.0169491,-0.0833333])  |
|1.0  |(4,[0,2,3],[-0.944444,-0.898305,-0.916667])                |
|2.0  |(4,[0,1,2,3],[-0.277778,-0.583333,-0.0169491,-0.166667])   |
|0.0  |(4,[0,1,2,3],[0.111111,-0.333333,0.38983,0.166667])        |
|2.0  |(4,[0,1,2,3],[-0.222222,-0.166667,0.0847457,-0.0833333])   |
+-----+-----------------------------------------------------------+

我正在嘗試為 3 個標簽安裝 model。

scala> training.select("label").distinct.show
+-----+
|label|
+-----+
|  0.0|
|  1.0|
|  2.0|
+-----+

擬合邏輯回歸 Model

import org.apache.spark.ml.classification.LogisticRegression
val lr = new LogisticRegression().setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
​
// Fit the model
val lrModel = lr.fit(training)
​

現在,當我嘗試查看系數矩陣時,它給了我一個具有 3 行(用於 3 個標簽)和 4 列(4 個輸入特征)的矩陣

scala> lrModel.coefficientMatrix.toDense
res13: org.apache.spark.ml.linalg.DenseMatrix =
0.0  0.0  0.0                  0.3176483191238039
0.0  0.0  -0.7803943459681859  -0.3769611423403096
0.0  0.0  0.0                  0.0

這是每個標簽的截距 -

scala> lrModel.interceptVector
res15: org.apache.spark.ml.linalg.Vector = [0.05165231659832854,-0.12391224990853622,0.07225993331020768]

我想使用系數矩陣和截距向量創建一個特征重要性Spark dataframe以獲得最終結果 dataframe 像這樣 -

label feature name  coefficient intercept
0         0             0         0.051
0         1             0         0.051
0         2             0         0.051
0         3             0.3176    0.051
1         0             0         -0.123
1         1             0         -0.123
1         2             -0.78     -0.123
1         3             -0.37     -0.123
2         0             0         0.072
2         1             0         0.072
2         2             0         0.072
2         3             0         0.072

每個特征對每個 label 都有一個系數,因此 output 中的總記錄將是labels * features ,即3 * 4 = 12

我希望這個過程是動態的,將其包裝在 function 中,以便我可以將其重新用於任意數量的功能和標簽。

我正在從這里讀取數據 - https://github.com/apache/spark/blob/master/data/mllib/sample_multiclass_classification_data.txt

我假設這里 lr 是以你的 pyspark 的邏輯回歸為例。 下面的代碼我嘗試了多項邏輯回歸

weights=lr.coefficientMatrix
rows = weights.toArray().tolist()
df = spark.createDataFrame(rows,["<your list of features columns used for training"])

上面的代碼對我有用,不用擔心序列分配你訓練過的順序

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM