简体   繁体   English

Spark Streaming-Parquet文件上传到S3错误

[英]Spark Streaming - Parquet file upload to S3 error

I'm completely new in Spark Streaming topic. 我在Spark Streaming主题中是全新的。
Via streaming application I'm creating Parquet files of size about 2,5MB and store them on S3/Local directory. 通过流应用程序,我正在创建大小约为2.5MB的Parquet文件,并将它们存储在S3 / Local目录中。

Method I'm using is as follow: 我使用的方法如下:

data.write.parquet(destination)

where "data" is a DataFrame 其中“数据”是一个DataFrame

If destination is a local path, everything works like a charm but if only I send it to s3 with path like "s3n://bucket/directory/filename" I'm getting following exception: 如果目的地是本地路径,则所有内容都可以像超级按钮一样工作,但是如果仅使用“ s3n:// bucket / directory / filename”之类的路径将其发送到s3,则会出现以下异常:

    15/12/17 10:47:06 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker-3,5,main]
    java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method)
        at org.apache.hadoop.io.nativeio.NativeIO$Windows.access(NativeIO.java:557)
        at org.apache.hadoop.fs.FileUtil.canRead(FileUtil.java:977)
        at org.apache.hadoop.util.DiskChecker.checkAccessByFileMethods(DiskChecker.java:187)
        at org.apache.hadoop.util.DiskChecker.checkDirAccess(DiskChecker.java:174)
        at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:108)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.confChanged(LocalDirAllocator.java:285)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:344)
        at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.createTmpFileForWrite(LocalDirAllocator.java:416)
        at org.apache.hadoop.fs.LocalDirAllocator.createTmpFileForWrite(LocalDirAllocator.java:198)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.newBackupFile(NativeS3FileSystem.java:263)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsOutputStream.<init>(NativeS3FileSystem.java:245)
        at org.apache.hadoop.fs.s3native.NativeS3FileSystem.create(NativeS3FileSystem.java:412)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:889)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:786)
        at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:176)
        at org.apache.parquet.hadoop.ParquetFileWriter.<init>(ParquetFileWriter.java:160)
        at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:289)
        at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:262)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetRelation.scala:94)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation$$anon$3.newInstance(ParquetRelation.scala:272)
        at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:234)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
        at org.apache.spark.scheduler.Task.run(Task.scala:88)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

Read from bucket operation works fine. 从存储桶操作中读取可以正常工作。 Despite the error there is sth stored on bucket. 尽管有错误,还是有一些存储在存储桶中。 Like "directory&folder" and it creates folders for given path but in the end instead of file there is "filename&folder" file. 像“目录和文件夹”一样,它为给定的路径创建文件夹,但最后是“文件名和文件夹”文件,而不是文件。

Tech Details:# 技术细节:#

  • S3 Browser S3浏览器
  • Windows 8.1 Windows 8.1
  • IntelliJ CE 14.1.5 IntelliJ CE 14.1.5
  • Spark Streaming Application 火花流应用
  • Spark 1.5 for Hadoop 2.6.0 适用于Hadoop 2.6.0的Spark 1.5

Problem was in Hadoop libs. 问题出在Hadoop库中。 I had to rebuild winutils (winutils.exe) and native lib (hadoop.dll) with windows SDK 7 then I had to move it to %HADOOP_HOME%\\bin% and add %HADOOP_HOME%\\bin% to Path variable. 我必须使用Windows SDK 7重建winutils(winutils.exe)和本机lib(hadoop.dll),然后将其移至%HADOOP_HOME%\\bin%并将%HADOOP_HOME%\\bin%Path变量。 Projects to rebuild can be found under hadoop-2.7.1-src\\hadoop-common-project\\hadoop-common\\target . 可以在hadoop-2.7.1-src\\hadoop-common-project\\hadoop-common\\target下找到要重建hadoop-2.7.1-src\\hadoop-common-project\\hadoop-common\\target For win utils I recommend to use windows optimized branch http://svn.apache.org/repos/asf/hadoop/common/branches/branch-trunk-win/ 对于Win utils,我建议使用Windows优化分支http://svn.apache.org/repos/asf/hadoop/common/branches/branch-trunk-win/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM