简体   繁体   English

如何使用PySpark从行中提取向量

[英]How to extract a Vector from a Row using PySpark

I am trying to run logistic regression on sample data with PySpark. 我正在尝试使用PySpark对样本数据运行逻辑回归。 Facing problem on applying 'LabeledPoint' after hashing. 散列后应用“ LabeledPoint”时面临的问题。

Input data frame: 输入数据帧:

+--+--------+
|C1|      C2|
+--+--------+
| 0|776ce399|
| 0|3486227d|
| 0|e5ba7672|
| 1|3486227d|
| 0|e5ba7672|
+--+--------+

After applying hashing on column C2, 在列C2上应用哈希之后,

tokenizer = Tokenizer(inputCol="C2", outputCol="words")
wordsData = tokenizer.transform(df)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)



+--+--------+--------------------+---------------+--------------------+
|C1|      C2|               words|    rawFeatures|            features|
+--+--------+--------------------+---------------+--------------------+
| 0|776ce399|ArrayBuffer(776ce...|(20,[15],[1.0])|(20,[15],[2.30003...|
| 0|3486227d|ArrayBuffer(34862...| (20,[0],[1.0])|(20,[0],[2.455603...|
| 0|e5ba7672|ArrayBuffer(e5ba7...| (20,[9],[1.0])|(20,[9],[0.660549...|
| 1|3486227d|ArrayBuffer(34862...| (20,[0],[1.0])|(20,[0],[2.455603...|
| 0|e5ba7672|ArrayBuffer(e5ba7...| (20,[9],[1.0])|(20,[9],[0.660549...|
+--+--------+--------------------+---------------+--------------------+

now to apply logistic regression, when i perform LabeledPoint 现在在我执行LabeledPoint时应用逻辑回归

temp = rescaledData.map(lambda line: LabeledPoint(line[0],line[4]))

getting following error, 出现以下错误,

ValueError: setting an array element with a sequence.

Please help. 请帮忙。

Thanks for suggestion zero323. 感谢您的建议zero323。

Implemented using pipeline concept. 使用管道概念实现。

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer, IDF
from pyspark.sql import Row
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType



dfWithLabel = df.withColumn("label", col("C1").cast(DoubleType()))
tokenizer = Tokenizer(inputCol="C2", outputCol="D2")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="E2")
idf = IDF(inputCol=hashingTF.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF,idf,lr])

# Fit the pipeline to training documents.
model = pipeline.fit(dfWithLabel) 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM