简体   繁体   English

Hadoop-如何获取HDFS文件的Path对象

[英]Hadoop - How to get a Path object of an HDFS file

I'm trying to figure out the various ways to write content/files to the HDFS in a Hadoop cluster. 我正在尝试找出将内容/文件写入Hadoop集群中的HDFS的各种方法。

I know there is org.apache.hadoop.fs.FileSystem.get() and org.apache.hadoop.fs.FileSystem.getLocal() to create an output stream and write byte by byte. 我知道有org.apache.hadoop.fs.FileSystem.get()org.apache.hadoop.fs.FileSystem.getLocal()创建输出流并逐字节写入。 If you are making use of OutputCollector.collect() it doesn't seem like this is the intended way to write to the HDFS. 如果您正在使用OutputCollector.collect()则似乎不是写HDFS的预期方式。 I believe you have to use Outputcollector.collect() when implementing Mappers and Reducers, correct me if I'm wrong?. 我相信您在实现Outputcollector.collect()Outputcollector.collect()时必须使用Outputcollector.collect() ,如果我做错了,请纠正我吗?

I know you can set FileOutputFormat.setOutputPath() before even running the job but it looks like this can only accepts objects of type Path. 我知道您甚至可以在运行作业之前设置FileOutputFormat.setOutputPath() ,但看起来这只能接受Path类型的对象。

When looking at org.apache.hadoop.fs.path and looking at the path class, I do not see anything which allows you to specify remote or local. 在查看org.apache.hadoop.fs.path并查看path类时,我看不到任何允许您指定远程或本地的内容。 Then when looking up org.apache.hadoop.fs.FileSystem I do not see anything which returns an object of type path. 然后,当查找org.apache.hadoop.fs.FileSystem时,我看不到任何返回路径类型的对象的东西。

  1. Does FileOutputFormat.setOutputPath() always have to write to the local file system? FileOutputFormat.setOutputPath()是否始终必须写入本地文件系统? I don't think this is true, I vaguely remember reading that a jobs' output can be used as another jobs' input. 我不认为这是真的,我隐约记得读过一个作业的输出可以用作另一个作业的输入。 This leads me to believe there is also a way to set this to the HDFS. 这使我相信,还有一种方法可以将其设置为HDFS。
  2. Is the only way to write to the HDFS to use a data stream as described? 是写入HDFS以使用所述数据流的唯一方法吗?

org.apache.hadoop.fs.FileSystem.get and org.apache.hadoop.FileSystem.getLocal return a FileSystem object which is a generic that can be implemented both as a local filesystem or distibuted file system. org.apache.hadoop.fs.FileSystem.getorg.apache.hadoop.FileSystem.getLocal返回一个FileSystem对象,该对象是通用的,既可以实现为本地文件系统,也可以实现为分布式文件系统。
OutputCollector doest write to hdfs . OutputCollector不会写入hdfs。 it just provides a method collect for mappers and reducers to collect the data output (both intermediate and final). 它仅提供了一种collect方法,供映射器和约简器收集数据输出(中间和最终)。 By the way, its deprecated in favor of Context object. 顺便说一句,它不赞成使用Context对象。
FileOutputFormat.setOuptPath sets the final output directory by setting mapred.output.dir which can be on your local file system or distributed. FileOutputFormat.setOuptPath通过设置mapred.output.dir设置最终输出目录,该目录可以在本地文件系统上,也可以在分布式文件系统上。
About remote or local - fs.default.name sets those value . 关于远程或本地fs.default.name设置这些值。 If you have set it as file:/// it will take local file system. 如果将其设置为file:/// ,它将使用本地文件系统。 if set as hdfs:// it will take hdfs and so on. 如果设置为hdfs:// ,它将采用hdfs,依此类推。
And about writing to hdfs - whatever method you take that writes to files in hadoop , it will be using FSDataOuputStream underneath. 关于写入FSDataOuputStream无论采用FSDataOuputStream方法写入hadoop中的文件,都将在下面使用FSDataOuputStream FSDataOutputStrem is wrapper of java.io.OutputStream . FSDataOutputStremjava.io.OutputStream包装。 By the way, whenever you want to write to a filesystem in java, you have create a stream object for that. 顺便说一句,只要您想用Java写入文件系统,就已经为此创建了一个流对象。
FileOutputFormat has method FileOutputFormat.setOutputPath(job, output_path) where in place of output_path , you can specify , whether you want to use local file system or hdfs , overriding the settings of core-site.xml. FileOutputFormat具有FileOutputFormat.setOutputPath(job, output_path)方法FileOutputFormat.setOutputPath(job, output_path)其中可以代替output_path,指定要使用本地文件系统还是hdfs,从而覆盖core-site.xml的设置。 eg FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/path_to_file")) will set up output to be written to hdfs. 例如FileOutputFormat.setOutputPath(job, new Path("hdfs://localhost:9000/path_to_file"))将设置要写入到hdfs的输出。 change it to file:/// and you can write to local file system. 将其更改为file:///,您可以写入本地文件系统。 Change loclahost and portno as per your settings. 根据您的设置更改loclahost和portno。 In the same way, input can also be overridden at per job level. 同样,也可以按每个作业级别覆盖输入。 -

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM