繁体   English   中英

java.lang.ArrayIndexOutOfBoundsException: 1 同时在 spark Scala 中保存数据帧

[英]java.lang.ArrayIndexOutOfBoundsException: 1 while saving data frame in spark Scala

在 EMR 中,我们使用 Salesforce Bulk API 调用从 salesforce ZA8CFDE6331BD59EB2ACZ66FZ14FZ 获取记录对于其中一个 Object(TASK) 数据框,同时保存到镶木地板时出现错误。

    java.lang.ArrayIndexOutOfBoundsException: 1
        at org.apache.spark.sql.catalyst.expressions.GenericRow.get(rows.scala:174)
        at org.apache.spark.sql.Row$class.apply(Row.scala:163)
        at org.apache.spark.sql.catalyst.expressions.GenericRow.apply(rows.scala:166)
        at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:60)
        at org.apache.spark.sql.execution.RDDConversions$$anonfun$rowToRowRdd$1$$anonfun$apply$3.apply(ExistingRDD.scala:57)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:232)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
        at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
        val sfdcObjectSchema = StructType(
          nonCompoundMetas.map(_.Name).map(
            fieldName => StructField(fieldName, StringType, true)
          )
        )
        val sfdcObjectDF = spark.read.format("com.springml.spark.salesforce").option("username", userName).
          option("password", s"$sfdcPassword$sfdcToken").option("soql", retrievingSOQL).
          option("version", JavaUtils.getConfigProps(runtimeEnvironment).getProperty("sfdc.api.version")).
          option("sfObject", sfdcObject).option("bulk", "true").option("pkChunking", pkChunking).
          option("chunkSize", checkingSize).
          option("timeout", bulkTimeoutMillis.toString).option("maxCharsPerColumn", "-1").option("maxColumns", nonCompoundMetas.size.toString).
          schema(sfdcObjectSchema).load()
          sfdcObjectDF.na.drop("all").write.mode(SaveMode.Overwrite).parquet(s"${JavaUtils.getConfigProps(runtimeEnvironment).getProperty("etl.dataset.root")}/$accountName/$sfdcObject")

请帮助我们如何进一步调试此问题。

此问题是由于您的 Salesforce “SOQL”返回一个空的结果集,这会触发此运行时错误。

我认为根本原因是,当https://github.com/springml/spark-salesforce设计 Spark 数据源 API 时,它无法处理空的 ZBA834BA059A9A379459C112175EB88E 案例,所以存在此错误。 也许您想在 git 中创建一个问题来解决这个问题。

对于临时解决方案,您“可以”使用“select count(id)....”SOQL,确保结果为“>0”,然后生成 Dataframe 并在 Spark 中使用它。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM