[英]"java.lang.UnsatisfiedLinkError" when writing single CSV on S3
I'm facing the error above when trying to write a single CSV file from my on-prem Spark cluster to a S3 bucket.尝试将单个 CSV 文件从本地 Spark 集群写入 S3 存储桶时,我遇到了上述错误。
Here's my code:这是我的代码:
spark = SparkSession \
.builder \
.master('local[*]') \
.appName('my-app') \
.config("spark.driver.host", my_host_ip) \
.config("spark.driver.port", "12345") \
.config("spark.network.timeout", 10000000) \
.config("spark.executor.heartbeatInterval", 10000000) \
.config("spark.storage.blockManagerSlaveTimeoutMs", 10000000) \
.config("spark.sql.debug.maxToStringFields", 2000) \
.config("spark.driver.maxResultSize", "100g") \
.config("spark.cores.max", str(processador)) \
.config("spark.executor.memory", f'{str(mem)}g') \
.config("spark.driver.memory", f'{str(mem)}g') \
.config("spark.sql.session.timeZone", "UTC") \
.config("spark.driver.extraJavaOptions", "-Duser.timezone=GMT") \
.config("spark.executor.extraJavaOptions", "-Duser.timezone=GMT") \
.config("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true") \
.config("spark.driver.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true") \
.getOrCreate()
spark.sparkContext.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'my-key')
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'the-secret')
# creating a DF with 2 reg
df = spark.createDataFrame([(1, 'val-col-1'), (2, 'val-col-2')], ['col1', 'col2'])
# writing on bucket
df.repartition(1).write \
.save(path=f's3a://folder1/folder2/csv_output/', mode='overwrite', format='csv', header=True)
Here's the things I tried:这是我尝试过的事情:
Here's the jars and versions:这是罐子和版本:
Here's the stack of the error:这是错误的堆栈:
ERROR Executor: Exception in task 35.0 in stage 0.0 (TID 35)
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:609)
at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:187)
at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:285)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:416)
at org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)
at org.apache.hadoop.fs.s3a.S3AOutputStream.<init>(S3AOutputStream.java:87)
at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:410)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CsvOutputWriter.scala:38)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:84)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:264)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
In the end, is there even possible to write a single CSV from Spark (pyspark) to S3 and it is substantially faster than using a traditional approach (boto3, technically)?最后,甚至有可能将单个 CSV 从 Spark (pyspark) 写入 S3 并且比使用传统方法(从技术上讲是 boto3)快得多?
Did I misunderstood anything?我误解了什么吗?
Thanks in advance.提前致谢。
fs.s3a.fast.upload.buffer
to bytebuffer
将fs.s3a.fast.upload.buffer
设置为bytebuffer
action #3 will dodge the problem until you do something like try and use an s3a committer;操作 #3 将避免问题,直到您执行诸如尝试使用 s3a 提交者之类的操作; #2 is where you need to put some effort in #2 是你需要付出一些努力的地方
the libraries you need will be on https://github.com/cdarlint/winutils您需要的库将在https://github.com/cdarlint/winutils 上
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.