如何使用PySpark從行中提取向量

Question

我正在嘗試使用PySpark對樣本數據運行邏輯回歸。 散列后應用“ LabeledPoint”時面臨的問題。

輸入數據幀：

+--+--------+
|C1|      C2|
+--+--------+
| 0|776ce399|
| 0|3486227d|
| 0|e5ba7672|
| 1|3486227d|
| 0|e5ba7672|
+--+--------+

在列C2上應用哈希之后，

tokenizer = Tokenizer(inputCol="C2", outputCol="words")
wordsData = tokenizer.transform(df)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)



+--+--------+--------------------+---------------+--------------------+
|C1|      C2|               words|    rawFeatures|            features|
+--+--------+--------------------+---------------+--------------------+
| 0|776ce399|ArrayBuffer(776ce...|(20,[15],[1.0])|(20,[15],[2.30003...|
| 0|3486227d|ArrayBuffer(34862...| (20,[0],[1.0])|(20,[0],[2.455603...|
| 0|e5ba7672|ArrayBuffer(e5ba7...| (20,[9],[1.0])|(20,[9],[0.660549...|
| 1|3486227d|ArrayBuffer(34862...| (20,[0],[1.0])|(20,[0],[2.455603...|
| 0|e5ba7672|ArrayBuffer(e5ba7...| (20,[9],[1.0])|(20,[9],[0.660549...|
+--+--------+--------------------+---------------+--------------------+

現在在我執行LabeledPoint時應用邏輯回歸

temp = rescaledData.map(lambda line: LabeledPoint(line[0],line[4]))

出現以下錯誤，

ValueError: setting an array element with a sequence.

請幫忙。

Answer 1

感謝您的建議zero323。

使用管道概念實現。

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer, IDF
from pyspark.sql import Row
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType



dfWithLabel = df.withColumn("label", col("C1").cast(DoubleType()))
tokenizer = Tokenizer(inputCol="C2", outputCol="D2")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="E2")
idf = IDF(inputCol=hashingTF.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF,idf,lr])

# Fit the pipeline to training documents.
model = pipeline.fit(dfWithLabel)

如何使用PySpark從行中提取向量

問題描述

1 個解決方案

解決方案1
0 2015-09-15 11:12:05

如何使用PySpark從行中提取向量

問題描述

1 個解決方案

解決方案1 0 2015-09-15 11:12:05

解決方案1
0 2015-09-15 11:12:05