繁体   English   中英

当 pyspark 数据帧具有多个分区时,Teradata 快速加载失败

Teradata fastload fails when pyspark dataframe has more than one partition

提示:本站收集StackOverFlow近2千万问答,支持中英文搜索,鼠标放在语句上弹窗显示对应的参考中文或英文, 本站还提供   中文繁体   英文版本   中英对照 版本,有任何建议请联系yoyou2525@163.com。

我正在尝试使用 FASTLOAD 将 spark 数据帧写入teradata 如果我使用df_final = df_final.repartition(1)强制数据帧只有一个分区,则写入操作有效。 但是,如果有多个分区,则会失败。 由于数据量很大,如果在数据帧上应用 repartitioned(1),它将是主节点的开销。 我什至尝试将分区与会话数相匹配,但它不起作用。



    df_final.write.option("truncate",truncate)\
    .mode(mode).option("batchsize",100000)\
    .jdbc(url="jdbc:teradata://host/DBS_PORT=port,LOGMECH=TD2,TMODE=ANSI,CHARSET=UTF16,ENCRYPTDATA=ON,TYPE=FASTLOAD,SESSIONS=2,ERROR_TABLE_DATABASE=errortble",
    table="tempdb.temptable",
    properties=connectionProperties)

Teradata 版本:16.20.53.04
JDBC 版本:17.00.00.03

堆栈跟踪:



2022-01-13 15:58:04.701899: Loading data into tempdb.temptable with write mode as overwrite and truncate as true
An error occurred while calling o1002.jdbc.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 4 times, most recent failure: Lost task 0.3 in stage 15.0 (TID 31, X.X.X.X, executor 0): java.sql.BatchUpdateException: [Teradata JDBC Driver] [TeraJDBC 17.00.00.03] [Error 1154] [SQLState HY000] A failure occurred while inserting the batch of rows destined for database table "TempDB"."temptable". Details of the failure can be found in the exception chain that is accessible with getNextException.
    at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeBatchUpdateException(ErrorFactory.java:149)
    at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeBatchUpdateException(ErrorFactory.java:133)
    at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.executeBatch(FastLoadManagerPreparedStatement.java:2389)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:691)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:858)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:856)
    at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1001)
    at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1001)
    at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2379)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
    at org.apache.spark.scheduler.Task.run(Task.scala:117)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:655)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:658)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.sql.SQLException: [Teradata JDBC Driver] [TeraJDBC 17.00.00.03] [Error 1147] [SQLState HY000] The next failure(s) in the exception chain occurred while beginning FastLoad of database table "TempDB"."temptable"
    at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDriverJDBCException(ErrorFactory.java:95)
    at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDriverJDBCException(ErrorFactory.java:70)
    at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.beginFastLoad(FastLoadManagerPreparedStatement.java:966)
    at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.executeBatch(FastLoadManagerPreparedStatement.java:2210)
    ... 15 more

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2519)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2466)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2460)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2460)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1152)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1152)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1152)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2721)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2668)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2656)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:938)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2339)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2360)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2379)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2404)
    at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$1(RDD.scala:1001)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:125)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:395)
    at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:999)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.saveTable(JdbcUtils.scala:856)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:58)
    at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
    at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:91)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:200)
    at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$3(SparkPlan.scala:252)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:248)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:158)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:157)
    at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:999)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:249)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:845)
    at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:199)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:999)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:437)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:421)
    at org.apache.spark.sql.DataFrameWriter.jdbc(DataFrameWriter.scala:827)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
    at py4j.Gateway.invoke(Gateway.java:295)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:251)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.sql.BatchUpdateException: [Teradata JDBC Driver] [TeraJDBC 17.00.00.03] [Error 1154] [SQLState HY000] A failure occurred while inserting the batch of rows destined for database table "TempDB"."temptable". Details of the failure can be found in the exception chain that is accessible with getNextException.
    at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeBatchUpdateException(ErrorFactory.java:149)
    at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeBatchUpdateException(ErrorFactory.java:133)
    at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.executeBatch(FastLoadManagerPreparedStatement.java:2389)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:691)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:858)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:856)
    at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:1001)
    at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:1001)
    at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2379)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:144)
    at org.apache.spark.scheduler.Task.run(Task.scala:117)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$9(Executor.scala:655)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1581)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:658)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
Caused by: java.sql.SQLException: [Teradata JDBC Driver] [TeraJDBC 17.00.00.03] [Error 1147] [SQLState HY000] The next failure(s) in the exception chain occurred while beginning FastLoad of database table "TempDB"."temptable"
    at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDriverJDBCException(ErrorFactory.java:95)
    at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDriverJDBCException(ErrorFactory.java:70)
    at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.beginFastLoad(FastLoadManagerPreparedStatement.java:966)
    at com.teradata.jdbc.jdbc.fastload.FastLoadManagerPreparedStatement.executeBatch(FastLoadManagerPreparedStatement.java:2210)
    ... 15 more
1 个回复

您可能正试图同时向同一个目标表启动多个 FastLoad 操作,这是数据库不允许的。

您发布的异常堆栈跟踪不包括可从 SQLException.getNextException 方法获得的整个 SQLException 链。

您正在使用 Teradata JDBC 驱动程序 17.00.00.03。 在 Teradata JDBC Driver 17.10 中,我们添加了 FLATTEN=ON 连接参数,以将来自 JDBC FastLoad 操作的 SQLException 链组合到顶级抛出的 SQLException 中。

升级到最新的 Teradata JDBC Driver 17.10 版本,指定 FLATTEN=ON 连接参数,应该可以从数据库中看到根本原因错误。

1 PySpark Dataframe只有一个分区

我已经从一个大约有60k行的csv文件中加载了一些数据,当我检查它创建的分区数时,它仅显示为1个分区。 当我创建仅包含5行的示例数据框时,它创建了8个分区。 这引发了一个疑问,读取csv文件是否只会将所有数据加载到驱动程序? 我们是否需要调用parallelise(或其他函数)将数据移至工作程序 ...

2 pyspark:创建超过 1 个数据帧失败

我想将几个大型 Pandas 数据帧转换为 Spark 数据帧,然后操作和合并它们,如下所示: 但是出了点问题,我收到以下错误: 是否可以像这样在同一个会话中创建多个 Spark 数据帧? ...

3 当分区键具有多种值时如何使用 dynamodb:LeadingKeys

我的 Dynamo 表将tenant_id 作为我的多租户应用程序中的分区键,但除了tenant_id 之外,我的分区键中还包含其他类型的实体。 例如:(这是一个小例子,我们自始至终都在使用这种模式) 我想使用dynamodb:LeadingKeys来确保一个租户的数据永远不会被另一个租户访问 ...

8 当客户端计算机具有多个IP地址时,RMI服务器到客户端调用失败

首先,我想为这个问题发布一些真实的代码,但我不能,因为它太多了。 那就是说,这是我的情况: 服务器端 我有一个RMI服务器等待客户端连接和“注册”自己,以便服务器可以在客户端上进行函数调用。 基本上,服务器有一个已发布的函数,其作用类似于以下伪代码: 客户端 在启动 ...

2012-04-27 16:51:22 1 3001   java/ rmi
9 查询超过一个单词时,Google App Engine搜索将失败,并发生异常

我在使用Google App Engine中的“搜索” API时遇到问题。 本质上,问题如下: 我正在根据我知道具有要查找的数据的搜索索引执行搜索。 搜索查询由用户输入,其中有多个单词,例如“黑皮诺”。 如果用户按原样输入带有两个空格的单词,则搜索服务会引发如下异常: 如果最 ...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2022 STACKOOM.COM