获取可用于读取/写入HDFS的Hadoop FileSystem对象的正确方法是什么？

Question

What is the correct way to create a FileSystem object that can be used for reading from/writing to HDFS? 创建可用于读取/写入HDFS的FileSystem对象的正确方法是什么？ In some examples I've found, they do something like this: 在我发现的一些示例中，它们执行以下操作：

final Configuration conf = new Configuration();
conf.addResource(new Path("/usr/local/hadoop/etc/hadoop/core-site.xml"));
conf.addResource(new Path("/usr/local/hadoop/etc/hadoop/hdfs-site.xml"));

final FileSystem fs = FileSystem.get(conf);

From looking at the documentation for the Configuration class, it looks like the properties from core-site.xml are automatically loaded when the object is created if that file is on the classpath, so there is no need to set it again. 通过查看Configuration类的文档，如果该对象位于类路径中，则创建对象时似乎会自动加载core-site.xml中的属性，因此无需再次进行设置。

I haven't found anything that says why adding hdfs-site.xml would be required, and it seems to work fine without it. 我还没有发现任何说明为什么需要添加hdfs-site.xml的信息，没有它似乎可以正常工作。

Would it be safe to just put core-site.xml on the classpath and skip hdfs-site.xml, or should I be setting both like I've seen in the examples? 将core-site.xml放在类路径上并跳过hdfs-site.xml是安全的，还是应该像在示例中看到的那样设置两者？ In what cases would the properties from hdfs-site.xml be required? 在什么情况下需要hdfs-site.xml的属性？

Answer 1

FileSystem needs only one configuration key to successfully connect to HDFS. FileSystem仅需一个配置密钥即可成功连接到HDFS。 Previously it was fs.default.name . 以前是fs.default.name 。 From yarn onward it's changed to fs.defaultFS . 从yarn向前更改为fs.defaultFS 。 So the following snippet is sufficient for the connection. 因此，下面的代码段足以用于连接。

Configuration conf = new Configuration();
conf.set(key, "hdfs://host:port");  // where key="fs.default.name"|"fs.defaultFS"

FileSystem fs = FileSystem.get(conf);

Tip : Check the core-site.xml which key exists. 提示：检查core-site.xml哪个密钥存在。 Set the same value associated with it in conf . 在conf设置与其关联的相同值。 If the machine from where you are running the code doesn't have the host name mapping, put the its IP. 如果运行代码的计算机没有主机名映射，请输入其IP。 In mapR cluster value will have prefix like maprfs:// . 在mapR群集值将具有诸如maprfs://类的前缀。

Answer 2

For the question: 对于这个问题：

Would it be safe to just put core-site.xml on the classpath and skip hdfs-site.xml, or should I be setting both like I've seen in the examples? 将core-site.xml放在类路径上并跳过hdfs-site.xml是安全的，还是应该像在示例中看到的那样设置两者？ In what cases would the properties from hdfs-site.xml be required? 在什么情况下需要hdfs-site.xml的属性？

I do an experiment: if you are using CDH (Cloudera's Distribution Including Apache Hadoop, my version is Hadoop 2.6.0-cdh5.11.1), it's not safe to use core-site.xml only.It will throw Exception：我做一个实验：如果您使用的是CDH （Cloudera的发行版包括Apache Hadoop，我的版本是Hadoop 2.6.0-cdh5.11.1），仅使用core-site.xml是不安全的，这将引发异常：

Request processing failed; nested exception is java.lang.IllegalArgumentException: java.net.UnknownHostException

And if you add hdfs-site.xml , it worked. 而且，如果添加hdfs-site.xml ，它会起作用。

Answer 3

Here's a block of code from one of my projects for building a Configuration usable for HBase, HDFS and map-reduce. 下面是我的一个项目为建设一个代码块Configuration可用于HBase的，HDFS和地图降低。 Notice that addResource will search the active classpath for the resources entries you name. 请注意， addResource将在活动类路径中搜索您命名的资源条目。

HBaseConfiguration.addHbaseResources(config);
config.addResource("mapred-default.xml");
config.addResource("mapred-site.xml");

My classpath definitely includes the directories housing core-site.xml , hdfs-site.xml , mapred-site.xml , and hbase-site.xml . 我的类路径肯定包含目录core-site.xml ， hdfs-site.xml ， mapred-site.xml和hbase-site.xml 。

获取可用于读取/写入HDFS的Hadoop FileSystem对象的正确方法是什么？

问题描述

3 个解决方案

解决方案1
4 已采纳 2014-10-29 14:36:19

解决方案2
1 2018-11-15 04:22:38

解决方案3
0 2014-10-23 20:34:53

获取可用于读取/写入HDFS的Hadoop FileSystem对象的正确方法是什么？

问题描述

3 个解决方案

解决方案1 4 已采纳 2014-10-29 14:36:19

解决方案2 1 2018-11-15 04:22:38

解决方案3 0 2014-10-23 20:34:53

解决方案1
4 已采纳 2014-10-29 14:36:19

解决方案2
1 2018-11-15 04:22:38

解决方案3
0 2014-10-23 20:34:53