簡體   English   中英

無法將 scala.collection.immutable.List$SerializationProxy 的實例分配給字段 org.apache.spark.ZAC5C74B64B4B8352EF2F181AFFB2AC2AZ.execution.dataSource.RD。

[英]cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.datasources.v2.DataSourceRDD

我正在嘗試使用 airflow 向 k8s spark 集群提交 pyspark 作業。 在那個火花作業中,我正在使用 writestream foreachBatch function 來寫入流數據,而不管接收器類型如何,僅在我嘗試寫入數據時才會遇到此問題:

spark集群內部版本:spark 3.3.0 pyspark 3.3 scala 2.12.15 OpenJDK 64-Bit Server VM,11.0.15

內部 airflow
spark 版本 3.1.2 pyspark 3.1.2 scala 版本 2.12.10 OpenJDK 64 位服務器 VM,1.8.0

依賴項:org.scala-lang:scala-library:2.12.8,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0,org.ZB6EFD606D118D0F62066E31419FF024CCZ.spark:spark-sql: 3.3.0,org.apache.spark:spark-core_2.12:3.3.0,org.postgresql:postgresql:

我用來提交的 Dag 是:

import airflow
from datetime import timedelta
from airflow import DAG
from time import sleep
from datetime import datetime
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator

dag = DAG( dag_id = 'testpostgres.py', schedule_interval=None ,  start_date=datetime(2022, 1, 1), catchup=False)

spark_job = SparkSubmitOperator(application= '/usr/local/airflow/data/testpostgres.py',
                            conn_id= 'spark_kcluster',
                            task_id= 'spark_job_test',
                            dag= dag,
                            packages= "org.scala-lang:scala-library:2.12.8,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0,org.apache.spark:spark-sql_2.12:3.3.0,org.apache.spark:spark-core_2.12:3.3.0,org.postgresql:postgresql:42.3.3",
                            conf ={
                                   'deploy-mode' : 'cluster',
                                   'executor_cores' : 1,
                                   'EXECUTORS_MEM' : '2G',
                                   'name' : 'spark-py',
                                   'spark.kubernetes.namespace' : 'sandbox',
                                   'spark.kubernetes.file.upload.path' : '/usr/local/airflow/data',
                                   'spark.kubernetes.container.image' : '**********',
                                   'spark.kubernetes.container.image.pullPolicy' : 'IfNotPresent',
                                   'spark.kubernetes.authenticate.driver.serviceAccountName' : 'spark',
                                   'spark.kubernetes.driver.volumes.persistentVolumeClaim.rwopvc.options.claimName' : 'data-pvc',
                                   'spark.kubernetes.driver.volumes.persistentVolumeClaim.rwopvc.mount.path' : '/usr/local/airflow/data',
                                   'spark.driver.extraJavaOptions' : '-Divy.cache.dir=/tmp -Divy.home=/tmp'
                                  }

)

這是我要提交的工作:

from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import dayofweek
from pyspark.sql.functions import date_format
from pyspark.sql.functions import hour
from functools import reduce
from pyspark.sql.types import DoubleType, StringType, ArrayType
import pandas as pd
import json

spark = SparkSession.builder.appName('spark).getOrCreate()


kafka_topic_name = '****'
kafka_bootstrap_servers = '*********' + ':' + '*****'

streaming_dataframe = spark.readStream.format("kafka").option("kafka.bootstrap.servers", kafka_bootstrap_servers).option("subscribe", kafka_topic_name).option("startingOffsets", "earliest").load()
streaming_dataframe = streaming_dataframe.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

dataframe_schema = '******'
streaming_dataframe = streaming_dataframe.select(from_csv(col("value"), dataframe_schema).alias("pipeline")).select("pipeline.*")

tumblingWindows = streaming_dataframe.withWatermark("timeStamp", "48 hour").groupBy(window("timeStamp", "24 hour", "1 hour"), "phoneNumber").agg((F.first(F.col("duration")).alias("firstDuration")))

tumblingWindows = tumblingWindows.withColumn("start_window", F.col('window')['start'])
tumblingWindows = tumblingWindows.withColumn("end_window", F.col('window')['end'])
tumblingWindows = tumblingWindows.drop('window')

def postgres_write(tumblingWindows, epoch_id):

    tumblingWindows.write.jdbc(url=db_target_url, table=table_postgres, mode='append', properties=db_target_properties)
pass

db_target_url = 'jdbc:postgresql://' + '*******'+ ':' + '****' + '/' + 'test'

table_postgres = '******'

db_target_properties = {
     'user': 'postgres',
     'password': 'postgres',
     'driver': 'org.postgresql.Driver'
}
query = tumblingWindows.writeStream.foreachBatch(postgres_write).start().awaitTermination()

錯誤日志:

Driver stacktrace:
      at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
      at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
      at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
      at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
      at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
      at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
      at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
      at scala.Option.foreach(Option.scala:407)
      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
      at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
      at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
      at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
      at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
      at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:377)
      ... 42 more
Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.datasources.v2.DataSourceRDDPartition.inputPartitions of type scala.collection.Seq in instance of org.apache.spark.sql.execution.datasources.v2.DataSourceRDDPartition
      at java.base/java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(Unknown Source)
      at java.base/java.io.ObjectStreamClass$FieldReflector.checkObjectFieldValueTypes(Unknown Source)
      at java.base/java.io.ObjectStreamClass.checkObjFieldValueTypes(Unknown Source)
      at java.base/java.io.ObjectInputStream.defaultCheckFieldValues(Unknown Source)
      at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source)
      at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
      at java.base/java.io.ObjectInputStream.readObject0(Unknown Source)
      at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source)
      at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source)
      at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
      at java.base/java.io.ObjectInputStream.readObject0(Unknown Source)
      at java.base/java.io.ObjectInputStream.readObject(Unknown Source)
      at java.base/java.io.ObjectInputStream.readObject(Unknown Source)
      at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87)
      at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:129)
      at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:507)
      at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
      at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
      at java.base/java.lang.Thread.run(Unknown Source)
Traceback (most recent call last):
File "/usr/local/airflow/data/spark-upload-d03175bc-8c50-4baf-8383-a203182f16c0/debug.py", line 20, in <module>
  streaming_dataframe.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")\
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 107, in awaitTermination
File "/opt/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 196, in deco
pyspark.sql.utils.StreamingQueryException: Query [id = d0e140c1-830d-49c8-88b7-90b82d301408, runId = c0f38f58-6571-4fda-b3e0-98e4ffaf8c7a] terminated with exception: Writing job aborted
22/08/24 10:12:53 INFO SparkUI: Stopped Spark web UI at ************************
22/08/24 10:12:53 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
22/08/24 10:12:53 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
22/08/24 10:12:53 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
22/08/24 10:12:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/08/24 10:12:53 INFO MemoryStore: MemoryStore cleared
22/08/24 10:12:53 INFO BlockManager: BlockManager stopped
22/08/24 10:12:53 INFO BlockManagerMaster: BlockManagerMaster stopped
22/08/24 10:12:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/08/24 10:12:54 INFO SparkContext: Successfully stopped SparkContext
22/08/24 10:12:54 INFO ShutdownHookManager: Shutdown hook called
22/08/24 10:12:54 INFO ShutdownHookManager: Deleting directory /var/data/spark-32ef85e0-e85c-4ac6-a46d-d3379ca58468/spark-adecf44a-dc60-4a85-bbe3-bc125f5cc39f/pyspark-f3ffaa5e-a490-464a-98d2-fbce223628eb
22/08/24 10:12:54 INFO ShutdownHookManager: Deleting directory /var/data/spark-32ef85e0-e85c-4ac6-a46d-d3379ca58468/spark-adecf44a-dc60-4a85-bbe3-bc125f5cc39f
22/08/24 10:12:54 INFO ShutdownHookManager: Deleting directory /tmp/spark-5acdd5e6-7f6e-45ec-adae-e98862e1537c```



我最近遇到了這個問題。 我認為它發生在對來自 Kafka 的數據進行洗牌時。 我通過將 org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 的所有依賴項(jar)加載到項目中來修復它。 你可以在這里找到它們。 目前,我不知道哪些是足夠的。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM