Spark：如何僅使用LabeledPoint的某些功能運行邏輯回歸？

Question

我有一個關於巫婆的LabeledPoint我想要進行邏輯回歸：

Data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = 
MapPartitionsRDD[3335] at map at <console>:44

使用代碼：

val splits = Data.randomSplit(Array(0.75, 0.25), seed = 2L)
val training = splits(0).cache()
val test = splits(1)

val numIterations = 100
val model = LogisticRegressionWithSGD.train(training, numIterations)

我的問題是我不想使用LabeledPoint所有功能，但只使用其中的一些功能。 我有一個我不想使用的功能列表，例如：

LoF=List(223244,334453...

如何才能從LabeledPoint獲取我想要使用的功能？在邏輯回歸中選擇它們？

Answer 1

特征選擇允許選擇最相關的特征以用於模型構造。 特征選擇減小了向量空間的大小，進而減少了向量的任何后續操作的復雜性。 可以使用保持的驗證集來調整要選擇的功能的數量。

實現目標的一種方法是使用ElementwiseProduct 。

ElementwiseProduct使用逐元素乘法將每個輸入向量乘以提供的“權重”向量。 換句話說，它通過標量乘數來縮放數據集的每一列。 這表示輸入矢量v和變換矢量w之間的Hadamard乘積，以產生結果矢量。

因此，如果我們將要保持的特征的權重設置為1.0而將其他特征設置為0.0，我們可以說由原始矢量的ElementwiseProduct和0-1權重向量計算的剩余結果特征將選擇我們需要的特征：

import org.apache.spark.mllib.feature.ElementwiseProduct
import org.apache.spark.mllib.linalg.Vectors

// Creating dummy LabeledPoint RDD
val data = sc.parallelize(Array(LabeledPoint(1.0, Vectors.dense(1.0, 0.0, 3.0,5.0,1.0)), LabeledPoint(1.0,Vectors.dense(4.0, 5.0, 6.0,1.0,2.0)),LabeledPoint(0.0,Vectors.dense(4.0, 2.0, 3.0,0.0,2.0))))

data.toDF.show

// +-----+--------------------+
// |label|            features|
// +-----+--------------------+
// |  1.0|[1.0,0.0,3.0,5.0,...|
// |  1.0|[4.0,5.0,6.0,1.0,...|
// |  0.0|[4.0,2.0,3.0,0.0,...|
// +-----+--------------------+

// You'll need to know how many features you have, I have used 5 for the example
val numFeatures = 5

// The indices represent the features we want to keep 
// Note : indices start with 0 so actually here you are keeping features 4 and 5.
val indices = List(3, 4).toArray

// Now we can create our weights vectors
val weights = Array.fill[Double](indices.size)(1)

// Create the sparse vector of the features we need to keep.
val transformingVector = Vectors.sparse(numFeatures, indices, weights)

// Init our vector transformer
val transformer = new ElementwiseProduct(transformingVector)

// Apply it to the data.
val transformedData = data.map(x => LabeledPoint(x.label,transformer.transform(x.features).toSparse))

transformedData.toDF.show

// +-----+-------------------+
// |label|           features|
// +-----+-------------------+
// |  1.0|(5,[3,4],[5.0,1.0])|
// |  1.0|(5,[3,4],[1.0,2.0])|
// |  0.0|      (5,[4],[2.0])|
// +-----+-------------------+

注意：

您注意到我使用稀疏矢量表示進行空間優化。
特征是稀疏向量。

Spark：如何僅使用LabeledPoint的某些功能運行邏輯回歸？

問題描述

1 個解決方案

解決方案1
3 已采納 2015-11-17 15:10:57

Spark：如何僅使用LabeledPoint的某些功能運行邏輯回歸？

問題描述

1 個解決方案

解決方案1 3 已采納 2015-11-17 15:10:57

解決方案1
3 已采納 2015-11-17 15:10:57