简体繁体 English

brew安装apache-spark无法访问s3文件

[英]brew installed apache-spark unable to access s3 files

原文 2015-11-06 19:17:28 0 1 hadoop/ amazon-s3/ apache-spark/ homebrew

After brew install apache-spark , sc.textFile("s3n://...") in spark-shell fails with java.io.IOException: No FileSystem for scheme: s3n . 在brew install apache-spark ， spark-shell sc.textFile("s3n://...")失败并出现java.io.IOException: No FileSystem for scheme: s3n 。 This is not the case in spark-shell accessed through an EC2 machine launched with spark-ec2 . 通过使用spark-ec2启动的EC2机器访问的spark-shell不是这种情况。 The homebrew formula appears to build with a sufficiently late version of Hadoop, and this error is thrown whether or not brew install hadoop has been run first. 自制的公式似乎是使用足够晚的Hadoop版本构建的，无论是否先运行brew install hadoop都会抛出此错误。

How can I install spark with homebrew such that it will be able to read s3n:// files? 如何使用自制软件安装spark，以便能够读取s3n://文件？

1 个解决方案

S3 filesystems aren't enabled in Hadoop 2.6 by default. 默认情况下，Hadoop 2.6中未启用S3文件系统。 So Spark versions that built with hadoop2.6 have no any S3-based fs available too. 因此，使用hadoop2.6构建的Spark版本也没有任何基于S3的fs。 Possible solutions: 可能的解决方案：

Solution 1. Use Spark built with Hadoop 2.4 (just change file name to "spark-1.5.1-bin-hadoop2.4.tgz" and update sha256) and s3n:// fs will work. 解决方案1.使用使用Hadoop 2.4构建的Spark（只需将文件名更改为“spark-1.5.1-bin-hadoop2.4.tgz”并更新sha256）和s3n：// fs将起作用。
Solution 2. Enable s3n:// filesystem. 解决方案2.启用s3n：// filesystem。 Specify the --conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem option when you start spark-shell. 启动spark-shell时，请指定--conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem选项。
Also you should set path to the required libraries: --conf spark.driver.extraClassPath=<path>/* --conf spark.executor.extraClassPath=<path>/* where <path> is the directory with hadoop-aws , aws-java-sdk-1.7.4 and guava-11.0.2 jar's. 您还应该设置所需库的路径： - --conf spark.driver.extraClassPath=<path>/* --conf spark.executor.extraClassPath=<path>/*其中<path>是hadoop-aws的目录， aws-java-sdk-1.7.4和guava-11.0.2 jar。
Solution 3. Use newer s3a:// filesystem. 解决方案3.使用较新的s3a：//文件系统。 It's enabled by default. 它默认启用。 Path to the required libraries should be set too. 还应设置所需库的路径。

Note 1: Options can also be set in conf/spark-defaults.conf file so you don't need to provide them every time with --conf , read the guide . 注1：也可以在conf / spark-defaults.conf文件中设置选项，这样你就不需要每次都用--conf提供它们，阅读指南。

Note 2: You can point <path> to share/hadoop/tools/lib directory in Hadoop 2.6+ distribution (s3a requires libraries from Hadoop 2.7+) or get required libraries from Maven Central ( 1 , 2 , 3 ). 注2：您可以指向<path>在Hadoop中分享/ Hadoop的/工具/ lib目录2.6+分布（S3A需要的库：Hadoop的2.7+），或从Maven的中央获得所需的库（ 1 ， 2 ， 3 ）。

Note 3: Provide credentials for s3n in environment variables, ~/.aws/config file or --conf spark.hadoop.fs.s3n.awsAccessKeyId= --conf spark.hadoop.fs.s3n.awsSecretAccessKey= . 注3：在环境变量中提供s3n的凭证， ~/.aws/config文件或--conf spark.hadoop.fs.s3n.awsAccessKeyId= --conf spark.hadoop.fs.s3n.awsSecretAccessKey= 。

s3a requires --conf spark.hadoop.fs.s3a.access.key= --conf spark.hadoop.fs.s3a.secret.key= options (no environment variables or .aws-file). s3a需要--conf spark.hadoop.fs.s3a.access.key= --conf spark.hadoop.fs.s3a.secret.key= options（无环境变量或.aws文件）。

Note 4: s3:// can be set as alias either for s3n ( --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem ) or s3a ( --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem ). 注4： s3：//可以设置为s3n（-- --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem ）或s3a（-- --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem ）。