如何使用PySpark从行中提取向量

Question

I am trying to run logistic regression on sample data with PySpark. 我正在尝试使用PySpark对样本数据运行逻辑回归。 Facing problem on applying 'LabeledPoint' after hashing. 散列后应用“ LabeledPoint”时面临的问题。

Input data frame: 输入数据帧：

+--+--------+
|C1|      C2|
+--+--------+
| 0|776ce399|
| 0|3486227d|
| 0|e5ba7672|
| 1|3486227d|
| 0|e5ba7672|
+--+--------+

After applying hashing on column C2, 在列C2上应用哈希之后，

tokenizer = Tokenizer(inputCol="C2", outputCol="words")
wordsData = tokenizer.transform(df)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData)
idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData)



+--+--------+--------------------+---------------+--------------------+
|C1|      C2|               words|    rawFeatures|            features|
+--+--------+--------------------+---------------+--------------------+
| 0|776ce399|ArrayBuffer(776ce...|(20,[15],[1.0])|(20,[15],[2.30003...|
| 0|3486227d|ArrayBuffer(34862...| (20,[0],[1.0])|(20,[0],[2.455603...|
| 0|e5ba7672|ArrayBuffer(e5ba7...| (20,[9],[1.0])|(20,[9],[0.660549...|
| 1|3486227d|ArrayBuffer(34862...| (20,[0],[1.0])|(20,[0],[2.455603...|
| 0|e5ba7672|ArrayBuffer(e5ba7...| (20,[9],[1.0])|(20,[9],[0.660549...|
+--+--------+--------------------+---------------+--------------------+

now to apply logistic regression, when i perform LabeledPoint 现在在我执行LabeledPoint时应用逻辑回归

temp = rescaledData.map(lambda line: LabeledPoint(line[0],line[4]))

getting following error, 出现以下错误，

ValueError: setting an array element with a sequence.

Please help. 请帮忙。

Answer 1

Thanks for suggestion zero323. 感谢您的建议zero323。

Implemented using pipeline concept. 使用管道概念实现。

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer, IDF
from pyspark.sql import Row
from pyspark.sql.functions import col
from pyspark.sql.types import DoubleType



dfWithLabel = df.withColumn("label", col("C1").cast(DoubleType()))
tokenizer = Tokenizer(inputCol="C2", outputCol="D2")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="E2")
idf = IDF(inputCol=hashingTF.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.01)
pipeline = Pipeline(stages=[tokenizer, hashingTF,idf,lr])

# Fit the pipeline to training documents.
model = pipeline.fit(dfWithLabel)

如何使用PySpark从行中提取向量

问题描述

1 个解决方案

解决方案1
0 2015-09-15 11:12:05

如何使用PySpark从行中提取向量

问题描述

1 个解决方案

解决方案1 0 2015-09-15 11:12:05

解决方案1
0 2015-09-15 11:12:05