Pyspark模式中StructType的VectorType

Question

I'm reading a parquet file that has the following schema: 我正在阅读具有以下架构的镶木地板文件：

df.printSchema()

root
 |-- time: integer (nullable = true)
 |-- amountRange: integer (nullable = true)
 |-- label: integer (nullable = true)
 |-- pcaVector: vector (nullable = true)

Now I want to test Pyspark structured streaming and I want to use the same parquet files. 现在我想测试Pyspark结构化流媒体，我想使用相同的镶木地板文件。 The closest schema that I was able to create was using ArrayType, but it doesn't work: 我能够创建的最接近的模式是使用ArrayType，但它不起作用：

schema = StructType(
    [
        StructField('time', IntegerType()),
        StructField('amountRange', IntegerType()),
        StructField('label', IntegerType()),
        StructField('pcaVector', ArrayType(FloatType()))

    ]
)
df_stream = spark.readStream\
    .format("parquet")\
    .schema(schema)\
    .load("/home/user/test_arch/data/fraud/")

Caused by: java.lang.ClassCastException: Expected instance of group converter but got "org.apache.spark.sql.execution.datasources.parquet.ParquetPrimitiveConverter"
        at org.apache.parquet.io.api.Converter.asGroupConverter(Converter.java:37)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter$RepeatedGroupConverter.<init>(ParquetRowConverter.scala:659)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetRowConverter.org$apache$spark$sql$execution$datasources$parquet$ParquetRowConverter$$newConverter(ParquetRowConverter.scala:308)

How can I create a schema with VectorType, that seems to exist only for Scala, for the StructType in Pyspark? 对于Pyspark中的StructType，我如何使用VectorType创建一个模式，该模式似乎只存在于Scala中？

Answer 1

The type is VectorUDT 类型是VectorUDT

from pyspark.ml.linalg import VectorUDT

StructField('pcaVector', VectorUDT())

Pyspark模式中StructType的VectorType

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-07-25 21:11:58

Pyspark模式中StructType的VectorType

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-07-25 21:11:58

解决方案1
3 已采纳 2018-07-25 21:11:58