Spark 读取 Teradata 和写入 Parquet 中的 ClassCastException

Question

我正在运行一个 Spark 作业，它使用来自 Teradata DBMS 的 SQL 查询读取数据帧。

当作业将文件作为镶木地板写入 S3 时，如

partition_keys = ["Cat$col1", "Cat$col2"]
df.write.mode("overwrite").partitionBy(partition_keys)

抛出以下java.lang.ClassCastException异常：

File "/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1249, in parquet
  File "/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
  File "/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o58.parquet.
: java.lang.ClassCastException: class java.util.ArrayList cannot be cast to class java.lang.String (java.util.ArrayList and java.lang.String are in module java.base of loader 'bootstrap')

DataFrame 的架构是：

StructType(List(StructField(Cat$col1,IntegerType,true),StructField(Cat$col2,StringType,true),StructField(Cat$col3,DateType,true),StructField(Cat$col4,DecimalType(13,2),true),StructField(Cat$col5,IntegerType,true),StructField(Cat$col6,IntegerType,true),StructField(Cat$col7,StringType,true),StructField(Cat$col8,StringType,true),StructField(Cat$col9,StringType,true),StructField(Cat$col10,StringType,true)))
root
 |-- Cat$col1: integer (nullable = true)
 |-- Cat$col2: string (nullable = true)
 |-- Cat$col3: date (nullable = true)
 |-- Cat$col4: decimal(13,2) (nullable = true)
 |-- Cat$col5: integer (nullable = true)
 |-- Cat$col6: integer (nullable = true)
 |-- Cat$col7: string (nullable = true)
 |-- Cat$col8: string (nullable = true)
 |-- Cat$col9: string (nullable = true)
 |-- Cat$col10: string (nullable = true)

注意：该架构未明确指定，因为 Spark 在尝试强加架构时会抛出另一个异常，并建议在读取数据时不指定架构。

目前尚不清楚在 Spark DataFrame 中创建ArrayList位置和原因，现在无法将其转换为 String。

Answer 1

问题是在partition_keys之前缺少*来解压列表。 问题解决如下：

partition_keys = ["Cat$col1", "Cat$col2"]
df.write.mode("overwrite").partitionBy(*partition_keys)

混淆是因为错误消息java.lang.ClassCastException: class java.util.ArrayList cannot be cast to class java.lang.String 。 似乎 python 列表在 Java 中被转换为 ArrayList，然后无法转换为 String 以用作分区名称。

Spark 读取 Teradata 和写入 Parquet 中的 ClassCastException

问题描述

1 个解决方案

解决方案1
0 2021-07-15 20:57:00

Spark 读取 Teradata 和写入 Parquet 中的 ClassCastException

问题描述

1 个解决方案

解决方案1 0 2021-07-15 20:57:00

解决方案1
0 2021-07-15 20:57:00