Spark 读取 Teradata 和写入 Parquet 中的 ClassCastException

Question

I'm running a Spark job which reads a DataFrame with a SQL query from a Teradata DBMS.我正在运行一个 Spark 作业，它使用来自 Teradata DBMS 的 SQL 查询读取数据帧。

When the job writes the file on S3 as a parquet, as当作业将文件作为镶木地板写入 S3 时，如

partition_keys = ["Cat$col1", "Cat$col2"]
df.write.mode("overwrite").partitionBy(partition_keys)

the following java.lang.ClassCastException exception is thrown:抛出以下java.lang.ClassCastException异常：

File "/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1249, in parquet
  File "/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
  File "/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o58.parquet.
: java.lang.ClassCastException: class java.util.ArrayList cannot be cast to class java.lang.String (java.util.ArrayList and java.lang.String are in module java.base of loader 'bootstrap')

The schema of the DataFrame is: DataFrame 的架构是：

StructType(List(StructField(Cat$col1,IntegerType,true),StructField(Cat$col2,StringType,true),StructField(Cat$col3,DateType,true),StructField(Cat$col4,DecimalType(13,2),true),StructField(Cat$col5,IntegerType,true),StructField(Cat$col6,IntegerType,true),StructField(Cat$col7,StringType,true),StructField(Cat$col8,StringType,true),StructField(Cat$col9,StringType,true),StructField(Cat$col10,StringType,true)))
root
 |-- Cat$col1: integer (nullable = true)
 |-- Cat$col2: string (nullable = true)
 |-- Cat$col3: date (nullable = true)
 |-- Cat$col4: decimal(13,2) (nullable = true)
 |-- Cat$col5: integer (nullable = true)
 |-- Cat$col6: integer (nullable = true)
 |-- Cat$col7: string (nullable = true)
 |-- Cat$col8: string (nullable = true)
 |-- Cat$col9: string (nullable = true)
 |-- Cat$col10: string (nullable = true)

Note: The schema is not explicitly specified as Spark throw another exception when tried to impose the schema with a suggestion on not to specify the schema when reading data.注意：该架构未明确指定，因为 Spark 在尝试强加架构时会抛出另一个异常，并建议在读取数据时不指定架构。

It is unclear where and why an ArrayList has been created in the Spark DataFrame which now cannot get casted to String.目前尚不清楚在 Spark DataFrame 中创建ArrayList位置和原因，现在无法将其转换为 String。

Answer 1

The issue was with a missing * before the partition_keys to unpack the list.问题是在partition_keys之前缺少*来解压列表。 Problem solved as follows:问题解决如下：

partition_keys = ["Cat$col1", "Cat$col2"]
df.write.mode("overwrite").partitionBy(*partition_keys)

The confusion was because of the error message java.lang.ClassCastException: class java.util.ArrayList cannot be cast to class java.lang.String .混淆是因为错误消息java.lang.ClassCastException: class java.util.ArrayList cannot be cast to class java.lang.String 。 Seem like the python list was being converted to an ArrayList in Java and then failing to be casted to String to be used as the partition name.似乎 python 列表在 Java 中被转换为 ArrayList，然后无法转换为 String 以用作分区名称。

Spark 读取 Teradata 和写入 Parquet 中的 ClassCastException

问题描述

1 个解决方案

解决方案1
0 2021-07-15 20:57:00

Spark 读取 Teradata 和写入 Parquet 中的 ClassCastException

问题描述

1 个解决方案

解决方案1 0 2021-07-15 20:57:00

解决方案1
0 2021-07-15 20:57:00