[英]brew installed apache-spark unable to access s3 files
After brew install apache-spark
, sc.textFile("s3n://...")
in spark-shell
fails with java.io.IOException: No FileSystem for scheme: s3n
. 在
brew install apache-spark
, spark-shell
sc.textFile("s3n://...")
失败并出现java.io.IOException: No FileSystem for scheme: s3n
。 This is not the case in spark-shell
accessed through an EC2 machine launched with spark-ec2
. 通过使用
spark-ec2
启动的EC2机器访问的spark-shell
不是这种情况。 The homebrew formula appears to build with a sufficiently late version of Hadoop, and this error is thrown whether or not brew install hadoop
has been run first. 自制的公式似乎是使用足够晚的Hadoop版本构建的,无论是否先运行
brew install hadoop
都会抛出此错误。
How can I install spark with homebrew such that it will be able to read s3n://
files? 如何使用自制软件安装spark,以便能够读取
s3n://
文件?
S3 filesystems aren't enabled in Hadoop 2.6 by default. 默认情况下,Hadoop 2.6中未启用S3文件系统。 So Spark versions that built with hadoop2.6 have no any S3-based fs available too.
因此,使用hadoop2.6构建的Spark版本也没有任何基于S3的fs。 Possible solutions:
可能的解决方案:
Solution 1. Use Spark built with Hadoop 2.4 (just change file name to "spark-1.5.1-bin-hadoop2.4.tgz" and update sha256) and s3n:// fs will work. 解决方案1.使用使用Hadoop 2.4构建的Spark(只需将文件名更改为“spark-1.5.1-bin-hadoop2.4.tgz”并更新sha256)和s3n:// fs将起作用。
Solution 2. Enable s3n:// filesystem. 解决方案2.启用s3n:// filesystem。 Specify the
--conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
option when you start spark-shell. 启动spark-shell时,请指定
--conf spark.hadoop.fs.s3n.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
选项。
Also you should set path to the required libraries: --conf spark.driver.extraClassPath=<path>/* --conf spark.executor.extraClassPath=<path>/*
where <path>
is the directory with hadoop-aws
, aws-java-sdk-1.7.4
and guava-11.0.2
jar's. 您还应该设置所需库的路径: -
--conf spark.driver.extraClassPath=<path>/* --conf spark.executor.extraClassPath=<path>/*
其中<path>
是hadoop-aws
的目录, aws-java-sdk-1.7.4
和guava-11.0.2
jar。
Solution 3. Use newer s3a:// filesystem. 解决方案3.使用较新的s3a://文件系统。 It's enabled by default.
它默认启用。 Path to the required libraries should be set too.
还应设置所需库的路径。
Note 1: Options can also be set in conf/spark-defaults.conf file so you don't need to provide them every time with --conf
, read the guide . 注1:也可以在conf / spark-defaults.conf文件中设置选项,这样你就不需要每次都用
--conf
提供它们,阅读指南 。
Note 2: You can point <path>
to share/hadoop/tools/lib directory in Hadoop 2.6+ distribution (s3a requires libraries from Hadoop 2.7+) or get required libraries from Maven Central ( 1 , 2 , 3 ). 注2:您可以指向
<path>
在Hadoop中分享/ Hadoop的/工具/ lib目录2.6+分布(S3A需要的库:Hadoop的2.7+),或从Maven的中央获得所需的库( 1 , 2 , 3 )。
Note 3: Provide credentials for s3n in environment variables, ~/.aws/config
file or --conf spark.hadoop.fs.s3n.awsAccessKeyId= --conf spark.hadoop.fs.s3n.awsSecretAccessKey=
. 注3:在环境变量中提供s3n的凭证,
~/.aws/config
文件或--conf spark.hadoop.fs.s3n.awsAccessKeyId= --conf spark.hadoop.fs.s3n.awsSecretAccessKey=
。
s3a requires --conf spark.hadoop.fs.s3a.access.key= --conf spark.hadoop.fs.s3a.secret.key=
options (no environment variables or .aws-file). s3a需要
--conf spark.hadoop.fs.s3a.access.key= --conf spark.hadoop.fs.s3a.secret.key=
options(无环境变量或.aws文件)。
Note 4: s3:// can be set as alias either for s3n ( --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
) or s3a ( --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
). 注4: s3://可以设置为s3n(--
--conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3native.NativeS3FileSystem
)或s3a(-- --conf spark.hadoop.fs.s3.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.