Spark Streaming-Parquet文件上传到S3错误

Question

I'm completely new in Spark Streaming topic. 我在Spark Streaming主题中是全新的。
Via streaming application I'm creating Parquet files of size about 2,5MB and store them on S3/Local directory. 通过流应用程序，我正在创建大小约为2.5MB的Parquet文件，并将它们存储在S3 / Local目录中。

Method I'm using is as follow: 我使用的方法如下：

data.write.parquet(destination)

where "data" is a DataFrame 其中“数据”是一个DataFrame

If destination is a local path, everything works like a charm but if only I send it to s3 with path like "s3n://bucket/directory/filename" I'm getting following exception: 如果目的地是本地路径，则所有内容都可以像超级按钮一样工作，但是如果仅使用“ s3n：// bucket / directory / filename”之类的路径将其发送到s3，则会出现以下异常：

    15/12/17 10:47:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-3,5,main]
    java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:557)
        at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
        at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:187)
        at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:285)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:416)
        at org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.newBackupFile(NativeS3FileSystem.java:263)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.<init>(NativeS3FileSystem.java:245)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:412)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
        at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:176)
        at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:160)
        at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:289)
        at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetRelation.scala:94)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
        at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:234)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Read from bucket operation works fine. 从存储桶操作中读取可以正常工作。 Despite the error there is sth stored on bucket. 尽管有错误，还是有一些存储在存储桶中。 Like "directory&folder" and it creates folders for given path but in the end instead of file there is "filename&folder" file. 像“目录和文件夹”一样，它为给定的路径创建文件夹，但最后是“文件名和文件夹”文件，而不是文件。

Tech Details:# 技术细节：＃

S3 Browser S3浏览器
Windows 8.1 Windows 8.1
IntelliJ CE 14.1.5 IntelliJ CE 14.1.5
Spark Streaming Application 火花流应用
Spark 1.5 for Hadoop 2.6.0 适用于Hadoop 2.6.0的Spark 1.5

Answer 1

Problem was in Hadoop libs. 问题出在Hadoop库中。 I had to rebuild winutils (winutils.exe) and native lib (hadoop.dll) with windows SDK 7 then I had to move it to %HADOOP_HOME%\\bin% and add %HADOOP_HOME%\\bin% to Path variable. 我必须使用Windows SDK 7重建winutils（winutils.exe）和本机lib（hadoop.dll），然后将其移至%HADOOP_HOME%\\bin%并将%HADOOP_HOME%\\bin%到Path变量。 Projects to rebuild can be found under hadoop-2.7.1-src\\hadoop-common-project\\hadoop-common\\target . 可以在hadoop-2.7.1-src\\hadoop-common-project\\hadoop-common\\target下找到要重建hadoop-2.7.1-src\\hadoop-common-project\\hadoop-common\\target 。 For win utils I recommend to use windows optimized branch http://svn.apache.org/repos/asf/hadoop/common/branches/branch-trunk-win/ 对于Win utils，我建议使用Windows优化分支http://svn.apache.org/repos/asf/hadoop/common/branches/branch-trunk-win/

Spark Streaming-Parquet文件上传到S3错误

问题描述

Tech Details:# 技术细节：＃

1 个解决方案

解决方案1
1 已采纳 2015-12-21 06:46:56

Spark Streaming-Parquet文件上传到S3错误

问题描述

Tech Details:# 技术细节：＃

1 个解决方案

解决方案1 1 已采纳 2015-12-21 06:46:56

解决方案1
1 已采纳 2015-12-21 06:46:56