簡體   English   中英

如何使用PySpark從行中提取向量

[英]How to extract a Vector from a Row using PySpark

我正在嘗試使用PySpark對樣本數據運行邏輯回歸。 散列后應用“ LabeledPoint”時面臨的問題。

輸入數據幀:

+--+--------+
|C1|      C2|
+--+--------+
| 0|776ce399|
| 0|3486227d|
| 0|e5ba7672|
| 1|3486227d|
| 0|e5ba7672|
+--+--------+

在列C2上應用哈希之后,

tokenizer = Tokenizer(inputCol="C2", outputCol="words")
wordsData = tokenizer.transform(df)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)



+--+--------+--------------------+---------------+--------------------+
|C1|      C2|               words|    rawFeatures|            features|
+--+--------+--------------------+---------------+--------------------+
| 0|776ce399|ArrayBuffer(776ce...|(20,[15],[1.0])|(20,[15],[2.30003...|
| 0|3486227d|ArrayBuffer(34862...| (20,[0],[1.0])|(20,[0],[2.455603...|
| 0|e5ba7672|ArrayBuffer(e5ba7...| (20,[9],[1.0])|(20,[9],[0.660549...|
| 1|3486227d|ArrayBuffer(34862...| (20,[0],[1.0])|(20,[0],[2.455603...|
| 0|e5ba7672|ArrayBuffer(e5ba7...| (20,[9],[1.0])|(20,[9],[0.660549...|
+--+--------+--------------------+---------------+--------------------+

現在在我執行LabeledPoint時應用邏輯回歸

temp = rescaledData.map(lambda line: LabeledPoint(line[0],line[4]))

出現以下錯誤,

ValueError: setting an array element with a sequence.

請幫忙。

感謝您的建議zero323。

使用管道概念實現。

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer, IDF
from pyspark.sql import Row
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType



dfWithLabel = df.withColumn("label", col("C1").cast(DoubleType()))
tokenizer = Tokenizer(inputCol="C2", outputCol="D2")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="E2")
idf = IDF(inputCol=hashingTF.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF,idf,lr])

# Fit the pipeline to training documents.
model = pipeline.fit(dfWithLabel) 

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM