[英]Create labledpoints from Spark Dataframe & how to pass list of names to VectorAssembler
I have a further questions from here https://stackoverflow.com/a/32557330/5235052 I am trying to build labledPoints from a dataframe, where I have the features and label in columns. 我还有一些问题来自https://stackoverflow.com/a/32557330/5235052我正在尝试从数据框架构建labledPoints,其中我有列中的功能和标签。 The features are all boolean with 1/0.
这些功能都是布尔值1/0。
Here is a sample row from the dataframe: 以下是数据框中的示例行:
| 0| 0| 0| 0| 0| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1| 0| 1| 0| 0| 0| 0| 0| 0| 0| 0| 0| 1|
#Using the code from above answer,
#create a list of feature names from the column names of the dataframe
df_columns = []
for c in df.columns:
if c == 'is_item_return': continue
df_columns.append(c)
#using VectorAssembler for transformation, am using only first 4 columns names
assembler = VectorAssembler()
assembler.setInputCols(df_columns[0:5])
assembler.setOutputCol('features')
transformed = assembler.transform(df)
#mapping also from above link
from pyspark.mllib.regression import LabeledPoint
from pyspark.sql.functions import col
new_df = transformed.select(col('is_item_return'), col("features")).map(lambda row: LabeledPoint(row.is_item_return, row.features))
When I inspect the contents of the RDD, I get the right label, but the feature vector is wrong. 当我检查RDD的内容时,我得到了正确的标签,但是特征向量是错误的。
(0.0,(5,[],[]))
Could someone help me understanding, how to pass the column names of an existing dataframe as feature names to the VectorAssembler? 有人可以帮我理解,如何将现有数据帧的列名称作为要素名称传递给VectorAssembler?
There is nothing wrong here. 这里没有错。 What you get is a string representation of the
SparseVector
which exactly reflects your input: 你得到的是
SparseVector
的字符串表示,它完全反映了你的输入:
assembler.setInputCols(df_columns[0:5])
) and the output vector is of length 5 assembler.setInputCols(df_columns[0:5])
),输出向量长度为5 indices
and values
arrays are empty indices
,因此values
数组为空 To illustrate this lets use Scala which provides useful toSparse
/ toDense
methods: 为了说明这一点,我们可以使用Scala,它为
toSparse
/ toDense
方法提供了有用的toSparse
:
import org.apache.spark.mllib.linalg.Vectors
val v = Vectors.dense(Array(0.0, 0.0, 0.0, 0.0, 0.0))
v.toSparse.toString
// String = (5,[],[])
v.toSparse.toDense.toString
// String = [0.0,0.0,0.0,0.0,0.0]
So with PySpark: 使用PySpark:
from pyspark.ml.feature import VectorAssembler
df = sc.parallelize([
tuple([0.0] * 5),
tuple([1.0] * 5),
(1.0, 0.0, 1.0, 0.0, 1.0),
(0.0, 1.0, 0.0, 1.0, 0.0)
]).toDF()
features = (VectorAssembler(inputCols=df.columns, outputCol="features")
.transform(df)
.select("features"))
features.show(4, False)
## +---------------------+
## |features |
## +---------------------+
## |(5,[],[]) |
## |[1.0,1.0,1.0,1.0,1.0]|
## |[1.0,0.0,1.0,0.0,1.0]|
## |(5,[1,3],[1.0,1.0]) |
## +---------------------+
It also show that assembler is choosing different representation depending on number of non-zero entries. 它还显示汇编程序根据非零条目的数量选择不同的表示。
features.flatMap(lambda x: x).map(type).collect()
## [pyspark.mllib.linalg.SparseVector,
## pyspark.mllib.linalg.DenseVector,
## pyspark.mllib.linalg.DenseVector,
## pyspark.mllib.linalg.SparseVector]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.