简体   繁体   English

获取可用于读取/写入HDFS的Hadoop FileSystem对象的正确方法是什么?

[英]What is the correct way to get a Hadoop FileSystem object that can be used for reading from/writing to HDFS?

What is the correct way to create a FileSystem object that can be used for reading from/writing to HDFS? 创建可用于读取/写入HDFS的FileSystem对象的正确方法是什么? In some examples I've found, they do something like this: 在我发现的一些示例中,它们执行以下操作:

final Configuration conf = new Configuration();
conf.addResource(new Path("/usr/local/hadoop/etc/hadoop/core-site.xml"));
conf.addResource(new Path("/usr/local/hadoop/etc/hadoop/hdfs-site.xml"));

final FileSystem fs = FileSystem.get(conf);

From looking at the documentation for the Configuration class, it looks like the properties from core-site.xml are automatically loaded when the object is created if that file is on the classpath, so there is no need to set it again. 通过查看Configuration类的文档,如果该对象位于类路径中,则创建对象时似乎会自动加载core-site.xml中的属性,因此无需再次进行设置。

I haven't found anything that says why adding hdfs-site.xml would be required, and it seems to work fine without it. 我还没有发现任何说明为什么需要添加hdfs-site.xml的信息,没有它似乎可以正常工作。

Would it be safe to just put core-site.xml on the classpath and skip hdfs-site.xml, or should I be setting both like I've seen in the examples? 将core-site.xml放在类路径上并跳过hdfs-site.xml是安全的,还是应该像在示例中看到的那样设置两者? In what cases would the properties from hdfs-site.xml be required? 在什么情况下需要hdfs-site.xml的属性?

FileSystem needs only one configuration key to successfully connect to HDFS. FileSystem仅需一个配置密钥即可成功连接到HDFS。 Previously it was fs.default.name . 以前是fs.default.name From yarn onward it's changed to fs.defaultFS . yarn向前更改为fs.defaultFS So the following snippet is sufficient for the connection. 因此,下面的代码段足以用于连接。

Configuration conf = new Configuration();
conf.set(key, "hdfs://host:port");  // where key="fs.default.name"|"fs.defaultFS"

FileSystem fs = FileSystem.get(conf);       

Tip : Check the core-site.xml which key exists. 提示:检查core-site.xml哪个密钥存在。 Set the same value associated with it in conf . conf设置与其关联的相同值。 If the machine from where you are running the code doesn't have the host name mapping, put the its IP. 如果运行代码的计算机没有主机名映射,请输入其IP。 In mapR cluster value will have prefix like maprfs:// . mapR群集值将具有诸如maprfs://类的前缀。

For the question: 对于这个问题:

Would it be safe to just put core-site.xml on the classpath and skip hdfs-site.xml, or should I be setting both like I've seen in the examples? 将core-site.xml放在类路径上并跳过hdfs-site.xml是安全的,还是应该像在示例中看到的那样设置两者? In what cases would the properties from hdfs-site.xml be required? 在什么情况下需要hdfs-site.xml的属性?

I do an experiment: if you are using CDH (Cloudera's Distribution Including Apache Hadoop, my version is Hadoop 2.6.0-cdh5.11.1), it's not safe to use core-site.xml only.It will throw Exception: 我做一个实验:如果您使用的是CDH (Cloudera的发行版包括Apache Hadoop,我的版本是Hadoop 2.6.0-cdh5.11.1),仅使用core-site.xml是不安全的,这将引发异常:

Request processing failed; nested exception is java.lang.IllegalArgumentException: java.net.UnknownHostException

And if you add hdfs-site.xml , it worked. 而且,如果添加hdfs-site.xml ,它会起作用。

Here's a block of code from one of my projects for building a Configuration usable for HBase, HDFS and map-reduce. 下面是我的一个项目为建设一个代码块Configuration可用于HBase的,HDFS和地图降低。 Notice that addResource will search the active classpath for the resources entries you name. 请注意, addResource将在活动类路径中搜索您命名的资源条目。

HBaseConfiguration.addHbaseResources(config);
config.addResource("mapred-default.xml");
config.addResource("mapred-site.xml");

My classpath definitely includes the directories housing core-site.xml , hdfs-site.xml , mapred-site.xml , and hbase-site.xml . 我的类路径肯定包含目录core-site.xmlhdfs-site.xmlmapred-site.xmlhbase-site.xml

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM