简体   繁体   English

Spark中的HDFS文件访问

[英]Hdfs file access in spark

I am developing an application , where I read a file from hadoop, process and store the data back to hadoop. 我正在开发一个应用程序,在其中我从hadoop读取文件,处理并将数据存储回hadoop。 I am confused what should be the proper hdfs file path format. 我很困惑应该是正确的hdfs文件路径格式。 When reading a hdfs file from spark shell like 从Spark Shell读取HDFS文件时

val file=sc.textFile("hdfs:///datastore/events.txt")

it works fine and I am able to read it. 它工作正常,我能够阅读。

But when I sumbit the jar to yarn which contains same set of code it is giving the error saying 但是当我将罐子加到包含相同代码集的纱线上时,它给出了错误提示

org.apache.hadoop.HadoopIllegalArgumentException: Uri without authority: hdfs:/datastore/events.txt

When I add name node ip as hdfs://namenodeserver/datastore/events.txt everything works. 当我将名称节点ip添加为hdfs://namenodeserver/datastore/events.txt一切正常。

I am bit confused about the behaviour and need an guidance. 我对行为有点困惑,需要指导。

Note: I am using aws emr set up and all the configurations are default. 注意:我正在使用aws emr设置,并且所有配置都是默认配置。

if you want to use sc.textFile("hdfs://...") you need to give the full path(absolute path), in your example that would be "nn1home:8020/.." 如果要使用sc.textFile(“ hdfs:// ...”),则需要提供完整路径(绝对路径),在您的示例中为“ nn1home:8020 /。”。

If you want to make it simple, then just use sc.textFile("hdfs:/input/war-and-peace.txt") 如果要使其简单,则只需使用sc.textFile(“ hdfs:/input/war-and-peace.txt”)

That's only one / 那只是一个/

I think it will work. 我认为它将起作用。

Problem solved. 问题解决了。 As I debugged further fs.defaultFS property was not used from core-site.xml when I just pass path as hdfs:///path/to/file . 当我进一步调试时,仅将path作为hdfs:///path/to/file传递时,未从core-site.xml使用fs.defaultFS属性。 But all the hadoop config properties are loaded (as I logged the sparkContext.hadoopConfiguration object. 但是所有的hadoop配置属性都已加载(当我记录了sparkContext.hadoopConfiguration对象时。

As a work around I manually read the property as sparkContext.hadoopConfiguration().get("fs.defaultFS) and appended this in the path. 作为解决方法,我手动将属性读取为sparkContext.hadoopConfiguration().get("fs.defaultFS)并将其附加在路径中。

I don't know is it a correct way of doing it. 我不知道这是正确的做法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM