在 S3 上写入单个 CSV 时出现“java.lang.UnsatisfiedLinkError”

Question

I'm facing the error above when trying to write a single CSV file from my on-prem Spark cluster to a S3 bucket.尝试将单个 CSV 文件从本地 Spark 集群写入 S3 存储桶时，我遇到了上述错误。

Here's my code:这是我的代码：

spark = SparkSession \
    .builder \
    .master('local[*]') \
    .appName('my-app') \
    .config("spark.driver.host", my_host_ip) \
    .config("spark.driver.port", "12345") \
    .config("spark.network.timeout", 10000000) \
    .config("spark.executor.heartbeatInterval", 10000000) \
    .config("spark.storage.blockManagerSlaveTimeoutMs", 10000000) \
    .config("spark.sql.debug.maxToStringFields", 2000) \
    .config("spark.driver.maxResultSize", "100g") \
    .config("spark.cores.max", str(processador)) \
    .config("spark.executor.memory", f'{str(mem)}g') \
    .config("spark.driver.memory", f'{str(mem)}g') \
    .config("spark.sql.session.timeZone", "UTC") \
    .config("spark.driver.extraJavaOptions", "-Duser.timezone=GMT") \
    .config("spark.executor.extraJavaOptions", "-Duser.timezone=GMT") \
    .config("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true") \
    .config("spark.driver.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true") \
    .getOrCreate()
            
spark.sparkContext.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")

spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'my-key')
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'the-secret')

# creating a DF with 2 reg
df = spark.createDataFrame([(1, 'val-col-1'), (2, 'val-col-2')], ['col1', 'col2'])

# writing on bucket
df.repartition(1).write \
    .save(path=f's3a://folder1/folder2/csv_output/', mode='overwrite', format='csv', header=True)

Here's the things I tried:这是我尝试过的事情：

Persist on my machine (it worked);坚持在我的机器上（它有效）；
Change hadoop version (and the dependencies as well);更改 hadoop 版本（以及依赖项）；
(Because my tests are running on windows): add and delete " winutils " and "hadoop.dll"; （因为我的测试是在 windows 上运行的）：添加和删除“ winutils ”和“hadoop.dll”；
Reading instead of writing;阅读代替写作；

Here's the jars and versions:这是罐子和版本：

Spark version = 3.0.1火花版本 = 3.0.1
Hadoop version = 2.7.4 Hadoop 版本 = 2.7.4
jets3t-0.9.0 jets3t-0.9.0
aws-java-sdk-1.7.4 aws-java-sdk-1.7.4
hadoop-aws-2.7.4 hadoop-aws-2.7.4
hadoop-client-2.7.4 hadoop-client-2.7.4
hadoop-common-2.7.4 hadoop-common-2.7.4

Here's the stack of the error:这是错误的堆栈：

ERROR Executor: Exception in task 35.0 in stage 0.0 (TID 35)
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:609)
    at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
    at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:187)
    at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:285)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:416)
    at org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)
    at org.apache.hadoop.fs.s3a.S3AOutputStream.<init>(S3AOutputStream.java:87)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:410)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
    at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
    at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
    at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CsvOutputWriter.scala:38)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:84)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:264)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:127)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

In the end, is there even possible to write a single CSV from Spark (pyspark) to S3 and it is substantially faster than using a traditional approach (boto3, technically)?最后，甚至有可能将单个 CSV 从 Spark (pyspark) 写入 S3 并且比使用传统方法（从技术上讲是 boto3）快得多？

Did I misunderstood anything?我误解了什么吗？

Thanks in advance.提前致谢。

Answer 1

use a consistent set of hadoop 3.2 JARs使用一组一致的 hadoop 3.2 JAR
looks like one of the native DLLs isn't on PATH, or of a different version.看起来本机 DLL 之一不在 PATH 上，或者不在其他版本中。
set fs.s3a.fast.upload.buffer to bytebuffer将fs.s3a.fast.upload.buffer设置为bytebuffer

action #3 will dodge the problem until you do something like try and use an s3a committer;操作 #3 将避免问题，直到您执行诸如尝试使用 s3a 提交者之类的操作； #2 is where you need to put some effort in #2 是你需要付出一些努力的地方

the libraries you need will be on https://github.com/cdarlint/winutils您需要的库将在https://github.com/cdarlint/winutils 上

在 S3 上写入单个 CSV 时出现“java.lang.UnsatisfiedLinkError”

问题描述

1 个解决方案

解决方案1
0 2021-11-05 12:42:59

在 S3 上写入单个 CSV 时出现“java.lang.UnsatisfiedLinkError”

问题描述

1 个解决方案

解决方案1 0 2021-11-05 12:42:59

解决方案1
0 2021-11-05 12:42:59