从Spark Dataframe创建labledpoints以及如何将名称列表传递给VectorAssembler

Question

I have a further questions from here https://stackoverflow.com/a/32557330/5235052 I am trying to build labledPoints from a dataframe, where I have the features and label in columns. 我还有一些问题来自https://stackoverflow.com/a/32557330/5235052我正在尝试从数据框架构建labledPoints，其中我有列中的功能和标签。 The features are all boolean with 1/0. 这些功能都是布尔值1/0。

Here is a sample row from the dataframe: 以下是数据框中的示例行：

|             0|       0|        0|            0|       0|            0|     1|        0|     0|           0|       0|       0|       0|           0|        0|         0|      0|            0|       0|           0|          0|         0|         0|              0|        0|        0|        0|         0|          0|    1|    0|    1|    0|    0|       0|           0|    0|     0|     0|     0|         0|         1|

#Using the code from above answer, 
#create a list of feature names from the column names of the dataframe
df_columns = []
for  c in df.columns:
    if c == 'is_item_return': continue
    df_columns.append(c)

#using VectorAssembler for transformation, am using only first 4 columns names
assembler = VectorAssembler()
assembler.setInputCols(df_columns[0:5])
assembler.setOutputCol('features')

transformed = assembler.transform(df)

   #mapping also from above link
   from pyspark.mllib.regression import LabeledPoint
   from pyspark.sql.functions import col

new_df = transformed.select(col('is_item_return'), col("features")).map(lambda row: LabeledPoint(row.is_item_return, row.features))

When I inspect the contents of the RDD, I get the right label, but the feature vector is wrong. 当我检查RDD的内容时，我得到了正确的标签，但是特征向量是错误的。

(0.0,(5,[],[]))

Could someone help me understanding, how to pass the column names of an existing dataframe as feature names to the VectorAssembler? 有人可以帮我理解，如何将现有数据帧的列名称作为要素名称传递给VectorAssembler？

Answer 1

There is nothing wrong here. 这里没有错。 What you get is a string representation of the SparseVector which exactly reflects your input: 你得到的是SparseVector的字符串表示，它完全反映了你的输入：

you take first five columns ( assembler.setInputCols(df_columns[0:5]) ) and the output vector is of length 5 你取前五列（ assembler.setInputCols(df_columns[0:5]) ），输出向量长度为5
since first columns of example input don't contain non-zero entries indices and values arrays are empty 由于示例输入的第一列不包含非零条目indices ，因此values数组为空

To illustrate this lets use Scala which provides useful toSparse / toDense methods: 为了说明这一点，我们可以使用Scala，它为toSparse / toDense方法提供了有用的toSparse ：

import org.apache.spark.mllib.linalg.Vectors

val v = Vectors.dense(Array(0.0, 0.0, 0.0, 0.0, 0.0))
v.toSparse.toString
// String = (5,[],[])

v.toSparse.toDense.toString
// String = [0.0,0.0,0.0,0.0,0.0]

So with PySpark: 使用PySpark：

from pyspark.ml.feature import VectorAssembler

df = sc.parallelize([
    tuple([0.0] * 5),
    tuple([1.0] * 5), 
    (1.0, 0.0, 1.0, 0.0, 1.0),
    (0.0, 1.0, 0.0, 1.0, 0.0)
]).toDF()

features = (VectorAssembler(inputCols=df.columns, outputCol="features")
    .transform(df)
    .select("features"))

features.show(4, False)

## +---------------------+
## |features             |
## +---------------------+
## |(5,[],[])            |
## |[1.0,1.0,1.0,1.0,1.0]|
## |[1.0,0.0,1.0,0.0,1.0]|
## |(5,[1,3],[1.0,1.0])  |
## +---------------------+

It also show that assembler is choosing different representation depending on number of non-zero entries. 它还显示汇编程序根据非零条目的数量选择不同的表示。

features.flatMap(lambda x: x).map(type).collect()

## [pyspark.mllib.linalg.SparseVector,
##  pyspark.mllib.linalg.DenseVector,
##  pyspark.mllib.linalg.DenseVector,
##  pyspark.mllib.linalg.SparseVector]

从Spark Dataframe创建labledpoints以及如何将名称列表传递给VectorAssembler

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-06-01 17:18:03

从Spark Dataframe创建labledpoints以及如何将名称列表传递给VectorAssembler

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-06-01 17:18:03

解决方案1
3 已采纳 2016-06-01 17:18:03