简体   繁体   English

从Spark Dataframe创建labledpoints以及如何将名称列表传递给VectorAssembler

[英]Create labledpoints from Spark Dataframe & how to pass list of names to VectorAssembler

I have a further questions from here https://stackoverflow.com/a/32557330/5235052 I am trying to build labledPoints from a dataframe, where I have the features and label in columns. 我还有一些问题来自https://stackoverflow.com/a/32557330/5235052我正在尝试从数据框架构建labledPoints,其中我有列中的功能和标签。 The features are all boolean with 1/0. 这些功能都是布尔值1/0。

Here is a sample row from the dataframe: 以下是数据框中的示例行:

|             0|       0|        0|            0|       0|            0|     1|        0|     0|           0|       0|       0|       0|           0|        0|         0|      0|            0|       0|           0|          0|         0|         0|              0|        0|        0|        0|         0|          0|    1|    0|    1|    0|    0|       0|           0|    0|     0|     0|     0|         0|         1|
#Using the code from above answer, 
#create a list of feature names from the column names of the dataframe
df_columns = []
for  c in df.columns:
    if c == 'is_item_return': continue
    df_columns.append(c)

#using VectorAssembler for transformation, am using only first 4 columns names
assembler = VectorAssembler()
assembler.setInputCols(df_columns[0:5])
assembler.setOutputCol('features')

transformed = assembler.transform(df)

   #mapping also from above link
   from pyspark.mllib.regression import LabeledPoint
   from pyspark.sql.functions import col

new_df = transformed.select(col('is_item_return'), col("features")).map(lambda row: LabeledPoint(row.is_item_return, row.features))

When I inspect the contents of the RDD, I get the right label, but the feature vector is wrong. 当我检查RDD的内容时,我得到了正确的标签,但是特征向量是错误的。

(0.0,(5,[],[]))

Could someone help me understanding, how to pass the column names of an existing dataframe as feature names to the VectorAssembler? 有人可以帮我理解,如何将现有数据帧的列名称作为要素名称传递给VectorAssembler?

There is nothing wrong here. 这里没有错。 What you get is a string representation of the SparseVector which exactly reflects your input: 你得到的是SparseVector的字符串表示,它完全反映了你的输入:

  • you take first five columns ( assembler.setInputCols(df_columns[0:5]) ) and the output vector is of length 5 你取前五列( assembler.setInputCols(df_columns[0:5]) ),输出向量长度为​​5
  • since first columns of example input don't contain non-zero entries indices and values arrays are empty 由于示例输入的第一列不包含非零条目indices ,因此values数组为空

To illustrate this lets use Scala which provides useful toSparse / toDense methods: 为了说明这一点,我们可以使用Scala,它为toSparse / toDense方法提供了有用的toSparse

import org.apache.spark.mllib.linalg.Vectors

val v = Vectors.dense(Array(0.0, 0.0, 0.0, 0.0, 0.0))
v.toSparse.toString
// String = (5,[],[])

v.toSparse.toDense.toString
// String = [0.0,0.0,0.0,0.0,0.0]

So with PySpark: 使用PySpark:

from pyspark.ml.feature import VectorAssembler

df = sc.parallelize([
    tuple([0.0] * 5),
    tuple([1.0] * 5), 
    (1.0, 0.0, 1.0, 0.0, 1.0),
    (0.0, 1.0, 0.0, 1.0, 0.0)
]).toDF()

features = (VectorAssembler(inputCols=df.columns, outputCol="features")
    .transform(df)
    .select("features"))

features.show(4, False)

## +---------------------+
## |features             |
## +---------------------+
## |(5,[],[])            |
## |[1.0,1.0,1.0,1.0,1.0]|
## |[1.0,0.0,1.0,0.0,1.0]|
## |(5,[1,3],[1.0,1.0])  |
## +---------------------+

It also show that assembler is choosing different representation depending on number of non-zero entries. 它还显示汇编程序根据非零条目的数量选择不同的表示。

features.flatMap(lambda x: x).map(type).collect()

## [pyspark.mllib.linalg.SparseVector,
##  pyspark.mllib.linalg.DenseVector,
##  pyspark.mllib.linalg.DenseVector,
##  pyspark.mllib.linalg.SparseVector]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将 VectorAssembler 输出的特征映射回 Spark ML 中的列名? - How to map features from the output of a VectorAssembler back to the column names in Spark ML? 如何在Spark SQL中从列表创建数据框? - How to create dataframe from list in Spark SQL? AttributeError:“ DataFrame”对象在VectorAssembler spark ML上没有属性“ get” - AttributeError: 'DataFrame' object has no attribute 'get' on VectorAssembler spark ML Apache Spark:如何从DataFrame创建矩阵? - Apache Spark: How to create a matrix from a DataFrame? 如何从名称列表中重命名每个 dataframe - How to rename each dataframe from a list of names 熊猫自动从具有列名称的系列列表中创建数据框 - pandas automatically create dataframe from list of series with column names 如何创建 Spark dataframe 以从 np.arrays 列表(由 RDKit 生成)提供给 sparks 随机森林实现? - How to create a Spark dataframe to feed to sparks random forest implementation from a list of np.arrays (generated by RDKit)? 如何从数据框创建列表? - How to create a list from the dataframe? 从Pandas DataFrame创建Spark DataFrame - Create Spark DataFrame from Pandas DataFrame 如何基于包含列名的列表创建数据框? - How to create a dataframe based on list containing column names?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM