简体   繁体   English

在 S3 上写入单个 CSV 时出现“java.lang.UnsatisfiedLinkError”

[英]"java.lang.UnsatisfiedLinkError" when writing single CSV on S3

I'm facing the error above when trying to write a single CSV file from my on-prem Spark cluster to a S3 bucket.尝试将单个 CSV 文件从本地 Spark 集群写入 S3 存储桶时,我遇到了上述错误。

Here's my code:这是我的代码:

spark = SparkSession \
    .builder \
    .master('local[*]') \
    .appName('my-app') \
    .config("spark.driver.host", my_host_ip) \
    .config("spark.driver.port", "12345") \
    .config("spark.network.timeout", 10000000) \
    .config("spark.executor.heartbeatInterval", 10000000) \
    .config("spark.storage.blockManagerSlaveTimeoutMs", 10000000) \
    .config("spark.sql.debug.maxToStringFields", 2000) \
    .config("spark.driver.maxResultSize", "100g") \
    .config("spark.cores.max", str(processador)) \
    .config("spark.executor.memory", f'{str(mem)}g') \
    .config("spark.driver.memory", f'{str(mem)}g') \
    .config("spark.sql.session.timeZone", "UTC") \
    .config("spark.driver.extraJavaOptions", "-Duser.timezone=GMT") \
    .config("spark.executor.extraJavaOptions", "-Duser.timezone=GMT") \
    .config("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true") \
    .config("spark.driver.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true") \
    .getOrCreate()
            
spark.sparkContext.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")

spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.us-east-2.amazonaws.com")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.BasicAWSCredentialsProvider")
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'my-key')
spark.sparkContext._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'the-secret')

# creating a DF with 2 reg
df = spark.createDataFrame([(1, 'val-col-1'), (2, 'val-col-2')], ['col1', 'col2'])

# writing on bucket
df.repartition(1).write \
    .save(path=f's3a://folder1/folder2/csv_output/', mode='overwrite', format='csv', header=True)

Here's the things I tried:这是我尝试过的事情:

  1. Persist on my machine (it worked);坚持在我的机器上(它有效);
  2. Change hadoop version (and the dependencies as well);更改 hadoop 版本(以及依赖项);
  3. (Because my tests are running on windows): add and delete " winutils " and "hadoop.dll"; (因为我的测试是在 windows 上运行的):添加和删除“ winutils ”和“hadoop.dll”;
  4. Reading instead of writing;阅读代替写作;

Here's the jars and versions:这是罐子和版本:

  • Spark version = 3.0.1火花版本 = 3.0.1
  • Hadoop version = 2.7.4 Hadoop 版本 = 2.7.4
  • jets3t-0.9.0 jets3t-0.9.0
  • aws-java-sdk-1.7.4 aws-java-sdk-1.7.4
  • hadoop-aws-2.7.4 hadoop-aws-2.7.4
  • hadoop-client-2.7.4 hadoop-client-2.7.4
  • hadoop-common-2.7.4 hadoop-common-2.7.4

Here's the stack of the error:这是错误的堆栈:

ERROR Executor: Exception in task 35.0 in stage 0.0 (TID 35)
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
    at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:609)
    at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
    at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:187)
    at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
    at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:285)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
    at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:416)
    at org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)
    at org.apache.hadoop.fs.s3a.S3AOutputStream.<init>(S3AOutputStream.java:87)
    at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:410)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:911)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:892)
    at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:789)
    at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
    at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
    at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CsvOutputWriter.scala:38)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:84)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
    at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:264)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:127)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

In the end, is there even possible to write a single CSV from Spark (pyspark) to S3 and it is substantially faster than using a traditional approach (boto3, technically)?最后,甚至有可能将单个 CSV 从 Spark (pyspark) 写入 S3 并且比使用传统方法(从技术上讲是 boto3)快得多?

Did I misunderstood anything?我误解了什么吗?

Thanks in advance.提前致谢。

  1. use a consistent set of hadoop 3.2 JARs使用一组一致的 hadoop 3.2 JAR
  2. looks like one of the native DLLs isn't on PATH, or of a different version.看起来本机 DLL 之一不在 PATH 上,或者不在其他版本中。
  3. set fs.s3a.fast.upload.buffer to bytebufferfs.s3a.fast.upload.buffer设置为bytebuffer

action #3 will dodge the problem until you do something like try and use an s3a committer;操作 #3 将避免问题,直到您执行诸如尝试使用 s3a 提交者之类的操作; #2 is where you need to put some effort in #2 是你需要付出一些努力的地方

the libraries you need will be on https://github.com/cdarlint/winutils您需要的库将在https://github.com/cdarlint/winutils 上

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM