简体   繁体   English

PySpark在Dataframe列中插入常量SparseVector

[英]PySpark insert a constant SparseVector in a Dataframe column

I wish to insert in my dataframe tfIdfFr a column named "ref" with a constant whose the type is pyspark.ml.linalg.SparseVector . 我希望在数据tfIdfFr插入名为"ref"的列,该列的类型为pyspark.ml.linalg.SparseVector

When I try this 当我尝试这个

ref = tfidfTest.select("features").collect()[0].features # the reference
tfIdfFr.withColumn("ref", ref).select("ref", "features").show()

I get this error AssertionError: col should be Column 我收到此错误AssertionError: col should be Column

And when i try this: 当我尝试这个:

from pyspark.sql.functions import lit
tfIdfFr.withColumn("ref", lit(ref)).select("ref", "features").show()

I get that error AttributeError: 'SparseVector' object has no attribute '_get_object_id' 我收到该错误AttributeError: 'SparseVector' object has no attribute '_get_object_id'

Do you know a solution to insert a constant SparseVector in a Dataframe column?* 您是否知道在Dataframe列中插入常量SparseVector的解决方案?*

In this case you I'd just skip collect: 在这种情况下,我将跳过收集:

ref = tfidfTest.select(col("features").alias("ref")).limit(1)
tfIdfFr.crossJoin(ref)

In general you can either use udf : 通常,您可以使用udf

from pyspark.ml.linalg import DenseVector, SparseVector, Vector, Vectors, \
 VectorUDT 
from pyspark.sql.functions import udf

def vector_lit(v): 
    assert isinstance(v, Vector) 
    return udf(lambda: v, VectorUDT())() 

Usage: 用法:

spark.range(1).select(
  vector_lit(Vectors.sparse(5, [1, 3], [-1, 1])
).alias("ref")).show()
+--------------------+
|                 ref|
+--------------------+
|(5,[1,3],[-1.0,1.0])|
+--------------------+
spark.range(1).select(vector_lit(Vectors.dense([1, 2, 3])).alias("ref")).show() 
+-------------+
|          ref|
+-------------+
|[1.0,2.0,3.0]|
+-------------+

It is also possible to use intermediate representation: 也可以使用中间表示:

import json
from pyspark.sql.functions import from_json, lit
from pyspark.sql.types import StructType, StructField

def as_column(v):
    assert isinstance(v, Vector) 
    if isinstance(v, DenseVector):
        j = lit(json.dumps({"v": {
          "type": 1,
          "values": v.values.tolist()
        }}))
    else:
        j = lit(json.dumps({"v": {
          "type": 0,
          "size": v.size,
          "indices": v.indices.tolist(),
          "values": v.values.tolist()
        }}))
    return from_json(j, StructType([StructField("v", VectorUDT())]))["v"]

Usage: 用法:

spark.range(1).select(
    as_column(Vectors.sparse(5, [1, 3], [-1, 1])
 ).alias("ref")).show()  
+--------------------+
|                 ref|
+--------------------+
|(5,[1,3],[-1.0,1.0])|
+--------------------+
spark.range(1).select(as_column(Vectors.dense([1, 2, 3])).alias("ref")).show()
+-------------+
|          ref|
+-------------+
|[1.0,2.0,3.0]|
+-------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 SparseVector PySpark 创建 dataframe - Create a dataframe with SparseVector PySpark PySpark-稀疏向量列到矩阵 - PySpark - SparseVector Column to Matrix PySpark:如何将具有 SparseVector 类型的列的 Spark 数据帧写入 CSV 文件? - PySpark: How to write a Spark dataframe having a column with type SparseVector into CSV file? 在Pyspark ML中的sparsevector数据类型列上创建一个Python转换器 - Create a Python transformer on sparsevector data type column in Pyspark ML Pyspark:SparseVector的总和错误 - Pyspark: sum error with SparseVector Pyspark中的SparseVector到DenseVector转换 - SparseVector to DenseVector conversion in Pyspark 如何在不分组的情况下在 pyspark dataframe 中添加具有最大值的常量列 - How to add a constant column with maximum value in a pyspark dataframe without grouping by pyspark dataframe 左连接并添加一个新的具有恒定值的列 - pyspark dataframe left join and add a new column with constant vlue Pyspark - 如何根据数据帧 2 中的列值在数据帧 1 中插入记录 - Pyspark - How to insert records in dataframe 1, based on a column value in dataframe2 Databricks Pyspark + 如何将数据帧架构作为列插入到数据帧中 - Databricks Pyspark + How to insert a dataframe schema as a column in a dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM