简体   繁体   中英

ClassCastException in Spark Read Teradata and Write Parquet

I'm running a Spark job which reads a DataFrame with a SQL query from a Teradata DBMS.

When the job writes the file on S3 as a parquet, as

partition_keys = ["Cat$col1", "Cat$col2"]
df.write.mode("overwrite").partitionBy(partition_keys)

the following java.lang.ClassCastException exception is thrown:

File "/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1249, in parquet
  File "/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/lib/python3.7/site-packages/pyspark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 111, in deco
  File "/lib/python3.7/site-packages/pyspark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o58.parquet.
: java.lang.ClassCastException: class java.util.ArrayList cannot be cast to class java.lang.String (java.util.ArrayList and java.lang.String are in module java.base of loader 'bootstrap')

The schema of the DataFrame is:

StructType(List(StructField(Cat$col1,IntegerType,true),StructField(Cat$col2,StringType,true),StructField(Cat$col3,DateType,true),StructField(Cat$col4,DecimalType(13,2),true),StructField(Cat$col5,IntegerType,true),StructField(Cat$col6,IntegerType,true),StructField(Cat$col7,StringType,true),StructField(Cat$col8,StringType,true),StructField(Cat$col9,StringType,true),StructField(Cat$col10,StringType,true)))
root
 |-- Cat$col1: integer (nullable = true)
 |-- Cat$col2: string (nullable = true)
 |-- Cat$col3: date (nullable = true)
 |-- Cat$col4: decimal(13,2) (nullable = true)
 |-- Cat$col5: integer (nullable = true)
 |-- Cat$col6: integer (nullable = true)
 |-- Cat$col7: string (nullable = true)
 |-- Cat$col8: string (nullable = true)
 |-- Cat$col9: string (nullable = true)
 |-- Cat$col10: string (nullable = true)

Note: The schema is not explicitly specified as Spark throw another exception when tried to impose the schema with a suggestion on not to specify the schema when reading data.

It is unclear where and why an ArrayList has been created in the Spark DataFrame which now cannot get casted to String.

The issue was with a missing * before the partition_keys to unpack the list. Problem solved as follows:

partition_keys = ["Cat$col1", "Cat$col2"]
df.write.mode("overwrite").partitionBy(*partition_keys)

The confusion was because of the error message java.lang.ClassCastException: class java.util.ArrayList cannot be cast to class java.lang.String . Seem like the python list was being converted to an ArrayList in Java and then failing to be casted to String to be used as the partition name.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM