简体   繁体   English

brew安装apache-spark无法访问s3文件

[英]brew installed apache-spark unable to access s3 files

After brew install apache-spark , sc.textFile("s3n://...") in spark-shell fails with java.io.IOException: No FileSystem for scheme: s3n . brew install apache-sparkspark-shell sc.textFile("s3n://...")失败并出现java.io.IOException: No FileSystem for scheme: s3n This is not the case in spark-shell accessed through an EC2 machine launched with spark-ec2 . 通过使用spark-ec2启动的EC2机器访问的spark-shell不是这种情况。 The homebrew formula appears to build with a sufficiently late version of Hadoop, and this error is thrown whether or not brew install hadoop has been run first. 自制的公式似乎是使用足够晚的Hadoop版本构建的,无论是否先运行brew install hadoop都会抛出此错误。

How can I install spark with homebrew such that it will be able to read s3n:// files? 如何使用自制软件安装spark,以便能够读取s3n://文件?

S3 filesystems aren't enabled in Hadoop 2.6 by default. 默认情况下,Hadoop 2.6中未启用S3文件系统。 So Spark versions that built with hadoop2.6 have no any S3-based fs available too. 因此,使用hadoop2.6构建的Spark版本也没有任何基于S3的fs。 Possible solutions: 可能的解决方案:

  • Solution 1. Use Spark built with Hadoop 2.4 (just change file name to "spark-1.5.1-bin-hadoop2.4.tgz" and update sha256) and s3n:// fs will work. 解决方案1.使用使用Hadoop 2.4构建的Spark(只需将文件名更改为“spark-1.5.1-bin-hadoop2.4.tgz”并更新sha256)和s3n:// fs将起作用。

  • Solution 2. Enable s3n:// filesystem. 解决方案2.启用s3n:// filesystem。 Specify the --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem option when you start spark-shell. 启动spark-shell时,请指定--conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem选项。

    Also you should set path to the required libraries: --conf spark.driver.extraClassPath=<path>/* --conf spark.executor.extraClassPath=<path>/* where <path> is the directory with hadoop-aws , aws-java-sdk-1.7.4 and guava-11.0.2 jar's. 您还应该设置所需库的路径: - --conf spark.driver.extraClassPath=<path>/* --conf spark.executor.extraClassPath=<path>/*其中<path>hadoop-aws的目录, aws-java-sdk-1.7.4guava-11.0.2 jar。

  • Solution 3. Use newer s3a:// filesystem. 解决方案3.使用较新的s3a://文件系统。 It's enabled by default. 它默认启用。 Path to the required libraries should be set too. 还应设置所需库的路径。

Note 1: Options can also be set in conf/spark-defaults.conf file so you don't need to provide them every time with --conf , read the guide . 注1:也可以在conf / spark-defaults.conf文件中设置选项,这样你就不需要每次都用--conf提供它们,阅读指南

Note 2: You can point <path> to share/hadoop/tools/lib directory in Hadoop 2.6+ distribution (s3a requires libraries from Hadoop 2.7+) or get required libraries from Maven Central ( 1 , 2 , 3 ). 注2:您可以指向<path>在Hadoop中分享/ Hadoop的/工具/ lib目录2.6+分布(S3A需要的库:Hadoop的2.7+),或从Maven的中央获得所需的库( 123 )。

Note 3: Provide credentials for s3n in environment variables, ~/.aws/config file or --conf spark.hadoop.fs.s3n.awsAccessKeyId= --conf spark.hadoop.fs.s3n.awsSecretAccessKey= . 注3:在环境变量中提供s3n的凭证, ~/.aws/config文件或--conf spark.hadoop.fs.s3n.awsAccessKeyId= --conf spark.hadoop.fs.s3n.awsSecretAccessKey=

s3a requires --conf spark.hadoop.fs.s3a.access.key= --conf spark.hadoop.fs.s3a.secret.key= options (no environment variables or .aws-file). s3a需要--conf spark.hadoop.fs.s3a.access.key= --conf spark.hadoop.fs.s3a.secret.key= options(无环境变量或.aws文件)。

Note 4: s3:// can be set as alias either for s3n ( --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem ) or s3a ( --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem ). 注4: s3://可以设置为s3n(-- --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem )或s3a(-- --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM