簡體   English   中英

如何使用 Databricks 在 ADLS Gen2/Blob 中傳輸大量文件?

[英]How to transfer large number of files within ADLS Gen2/Blob using Databricks?

我幾乎每天都需要使用 azure databricks 將相當數量(100k+)的文件從一個文件夾傳輸到 ADLS Gen2 上的另一個文件夾,dbutils.fs.ls 不是並行的,因此花費的時間比我能承受的要多得多。 無論如何要有效地轉移它? 我曾嘗試從文件中創建一個數據框並在地圖或該數據框的 foreach 中運行 dbutils.fs.ls 但它不起作用並給出錯誤 -

org.apache.spark.SparkException:作業因階段失敗而中止:階段 13.0 中的任務 10 失敗 4 次,最近失敗:階段 13.0 中丟失任務 10.3(TID 375、10.139.64.4、執行程序 0):java.io。 IOException: 方案沒有文件系統:wasbs

  • 由於階段失敗而中止作業:階段 13.0 中的任務 10 失敗 4 次,最近一次失敗:階段 13.0 中丟失任務 10.3(TID 375、10.139.64.4、執行程序 0):java.io.IOException:方案沒有文件系統:wasbs

我還根據我的發現添加了以下配置並將所需的 jar 添加到我的集群中,但仍然出現相同的錯誤。 sc.hadoopConfiguration.set("fs.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")

整個堆棧跟蹤 -

at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at com.databricks.backend.daemon.dbutils.FSUtils$.getFS(DBUtilsCore.scala:251)
at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$cp$1$$anonfun$apply$3.apply(DBUtilsCore.scala:114)
at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$cp$1$$anonfun$apply$3.apply(DBUtilsCore.scala:113)
at com.databricks.backend.daemon.dbutils.FSUtils$.com$databricks$backend$daemon$dbutils$FSUtils$$withFsSafetyCheck(DBUtilsCore.scala:81)
at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$cp$1.apply(DBUtilsCore.scala:113)
at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$cp$1.apply(DBUtilsCore.scala:113)
at com.databricks.backend.daemon.dbutils.FSUtils$.com$databricks$backend$daemon$dbutils$FSUtils$$withFsSafetyCheck(DBUtilsCore.scala:81)
at com.databricks.backend.daemon.dbutils.FSUtils$.cp(DBUtilsCore.scala:112)
at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.cp(DbfsUtilsImpl.scala:45)
at line6cd233c6e76046b688ada150c95cd077118.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1$$anonfun$apply$1.apply(command-3171921041650758:1)
at line6cd233c6e76046b688ada150c95cd077118.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1$$anonfun$apply$1.apply(command-3171921041650758:1)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at line6cd233c6e76046b688ada150c95cd077118.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-3171921041650758:1)
at line6cd233c6e76046b688ada150c95cd077118.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-3171921041650758:1)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:987)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:987)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2323)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2323)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
at org.apache.spark.scheduler.Task.run(Task.scala:113)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:537)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:543)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:2362)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2350)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:2349)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2349)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1102)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:1102)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1102)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2582)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2529)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2517)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:897)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2282)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2304)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2323)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2348)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:987)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:985)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:392)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:985)
at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply$mcV$sp(Dataset.scala:2810)
at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2810)
at org.apache.spark.sql.Dataset$$anonfun$foreachPartition$1.apply(Dataset.scala:2810)
at org.apache.spark.sql.Dataset$$anonfun$withNewRDDExecutionId$1.apply(Dataset.scala:3477)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withCustomExecutionEnv$1.apply(SQLExecution.scala:113)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:243)
at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:99)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:173)
at org.apache.spark.sql.Dataset.withNewRDDExecutionId(Dataset.scala:3473)
at org.apache.spark.sql.Dataset.foreachPartition(Dataset.scala:2809)
at line6cd233c6e76046b688ada150c95cd077118.$read$$iw$$iw$$iw$$iw$$iw$$iw.<init>(command-3171921041650758:1)
at line6cd233c6e76046b688ada150c95cd077118.$read$$iw$$iw$$iw$$iw$$iw.<init>(command-3171921041650758:46)
at line6cd233c6e76046b688ada150c95cd077118.$read$$iw$$iw$$iw$$iw.<init>(command-3171921041650758:48)
at line6cd233c6e76046b688ada150c95cd077118.$read$$iw$$iw$$iw.<init>(command-3171921041650758:50)
at line6cd233c6e76046b688ada150c95cd077118.$read$$iw$$iw.<init>(command-3171921041650758:52)
at line6cd233c6e76046b688ada150c95cd077118.$read$$iw.<init>(command-3171921041650758:54)
at line6cd233c6e76046b688ada150c95cd077118.$read.<init>(command-3171921041650758:56)
at line6cd233c6e76046b688ada150c95cd077118.$read$.<init>(command-3171921041650758:60)
at line6cd233c6e76046b688ada150c95cd077118.$read$.<clinit>(command-3171921041650758)
at line6cd233c6e76046b688ada150c95cd077118.$eval$.$print$lzycompute(<notebook>:7)
at line6cd233c6e76046b688ada150c95cd077118.$eval$.$print(<notebook>:6)
at line6cd233c6e76046b688ada150c95cd077118.$eval.$print(<notebook>)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.tools.nsc.interpreter.IMain$ReadEvalPrint.call(IMain.scala:793)
at scala.tools.nsc.interpreter.IMain$Request.loadAndRun(IMain.scala:1054)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:645)
at scala.tools.nsc.interpreter.IMain$WrappedRequest$$anonfun$loadAndRunReq$1.apply(IMain.scala:644)
at scala.reflect.internal.util.ScalaClassLoader$class.asContext(ScalaClassLoader.scala:31)
at scala.reflect.internal.util.AbstractFileClassLoader.asContext(AbstractFileClassLoader.scala:19)
at scala.tools.nsc.interpreter.IMain$WrappedRequest.loadAndRunReq(IMain.scala:644)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:576)
at scala.tools.nsc.interpreter.IMain.interpret(IMain.scala:572)
at com.databricks.backend.daemon.driver.DriverILoop.execute(DriverILoop.scala:215)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply$mcV$sp(ScalaDriverLocal.scala:202)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:202)
at com.databricks.backend.daemon.driver.ScalaDriverLocal$$anonfun$repl$1.apply(ScalaDriverLocal.scala:202)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExitInternal$.trapExit(DriverLocal.scala:714)
at com.databricks.backend.daemon.driver.DriverLocal$TrapExit$.apply(DriverLocal.scala:667)
at com.databricks.backend.daemon.driver.ScalaDriverLocal.repl(ScalaDriverLocal.scala:202)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$9.apply(DriverLocal.scala:396)
at com.databricks.backend.daemon.driver.DriverLocal$$anonfun$execute$9.apply(DriverLocal.scala:373)
at com.databricks.logging.UsageLogging$$anonfun$withAttributionContext$1.apply(UsageLogging.scala:238)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at com.databricks.logging.UsageLogging$class.withAttributionContext(UsageLogging.scala:233)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionContext(DriverLocal.scala:49)
at com.databricks.logging.UsageLogging$class.withAttributionTags(UsageLogging.scala:275)
at com.databricks.backend.daemon.driver.DriverLocal.withAttributionTags(DriverLocal.scala:49)
at com.databricks.backend.daemon.driver.DriverLocal.execute(DriverLocal.scala:373)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
at com.databricks.backend.daemon.driver.DriverWrapper$$anonfun$tryExecutingCommand$2.apply(DriverWrapper.scala:644)
at scala.util.Try$.apply(Try.scala:192)
at com.databricks.backend.daemon.driver.DriverWrapper.tryExecutingCommand(DriverWrapper.scala:639)
at com.databricks.backend.daemon.driver.DriverWrapper.getCommandOutputAndError(DriverWrapper.scala:485)
at com.databricks.backend.daemon.driver.DriverWrapper.executeCommand(DriverWrapper.scala:597)
at com.databricks.backend.daemon.driver.DriverWrapper.runInnerLoop(DriverWrapper.scala:390)
at com.databricks.backend.daemon.driver.DriverWrapper.runInner(DriverWrapper.scala:337)
at com.databricks.backend.daemon.driver.DriverWrapper.run(DriverWrapper.scala:219)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: No FileSystem for scheme: wasbs
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2660)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at com.databricks.backend.daemon.dbutils.FSUtils$.getFS(DBUtilsCore.scala:251)
at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$cp$1$$anonfun$apply$3.apply(DBUtilsCore.scala:114)
at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$cp$1$$anonfun$apply$3.apply(DBUtilsCore.scala:113)
at com.databricks.backend.daemon.dbutils.FSUtils$.com$databricks$backend$daemon$dbutils$FSUtils$$withFsSafetyCheck(DBUtilsCore.scala:81)
at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$cp$1.apply(DBUtilsCore.scala:113)
at com.databricks.backend.daemon.dbutils.FSUtils$$anonfun$cp$1.apply(DBUtilsCore.scala:113)
at com.databricks.backend.daemon.dbutils.FSUtils$.com$databricks$backend$daemon$dbutils$FSUtils$$withFsSafetyCheck(DBUtilsCore.scala:81)
at com.databricks.backend.daemon.dbutils.FSUtils$.cp(DBUtilsCore.scala:112)
at com.databricks.dbutils_v1.impl.DbfsUtilsImpl.cp(DbfsUtilsImpl.scala:45)
at line6cd233c6e76046b688ada150c95cd077118.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1$$anonfun$apply$1.apply(command-3171921041650758:1)
at line6cd233c6e76046b688ada150c95cd077118.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1$$anonfun$apply$1.apply(command-3171921041650758:1)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at line6cd233c6e76046b688ada150c95cd077118.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-3171921041650758:1)
at line6cd233c6e76046b688ada150c95cd077118.$read$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(command-3171921041650758:1)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:987)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$28.apply(RDD.scala:987)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2323)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2323)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
at org.apache.spark.scheduler.Task.run(Task.scala:113)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:537)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:543)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

“java.io.IOException: No FileSystem for scheme: wasbs”是 Azure Blob 配置問題,您需要配置 Azure ADLS 連接和庫。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM