如何使用python读取HDFS目录中的文件

Question

I am trying to read files inside a directory in HDFS using Python. 我试图使用Python读取HDFS目录中的文件。 I used below code but i am getting error. 我使用下面的代码，但我收到错误。

Code : 代码：

cat = Popen(["hadoop", "fs", "-cat", "/user/cloudera/CCMD"], stdout=PIPE)

Error : 错误：

cat: `/user/cloudera/CCMD': Is a directory
Traceback (most recent call last):
  File "hrkpat.py", line 6, in <module>
    tree = ET.parse(cat.stdout)
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 862, in parse
    tree.parse(source, parser)
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 587, in parse
    self._root = parser.close()
  File "/usr/lib64/python2.6/xml/etree/ElementTree.py", line 1254, in close
    self._parser.Parse("", 1) # end of data
xml.parsers.expat.ExpatError: no element found: line 1, column 0

Update: 更新：

I am having 10-15 xml files in my hdfs directory that i want to parse. 我在我想要解析的hdfs目录中有10-15个xml文件。 I am able to parse the xml when only one xml is present in the directory but as soon as i am having multiple number of files i am not able to parse the xml. 当目录中只有一个xml时，我能够解析xml，但是一旦我有多个文件，我就无法解析xml。 For this use case i want to write python code so that i can parse one file from my directory and once i parse it move to the next one. 对于这个用例，我想编写python代码，以便我可以从我的目录解析一个文件，一旦我解析它移动到下一个。

Answer 1

you can use wildcard char * to read all files in dir: 你可以使用通配符char *来读取dir中的所有文件：

hadoop fs -cat /user/cloudera/CCMD/*

Or just read xml files: 或者只是阅读xml文件：

hadoop fs -cat /user/cloudera/CCMD/*.xml

Answer 2

Exception is cat: '/user/cloudera/CCMD': Is a directory 例外是cat: '/user/cloudera/CCMD': Is a directory

You are trying to perform a file operation over a directory. 您正在尝试对目录执行文件操作。 Pass the path of a file to the command. 将文件的路径传递给命令。

Use this command in subprocess instead, 在subprocess使用此命令，

hadoop fs -cat /user/cloudera/CCMD/filename

如何使用python读取HDFS目录中的文件

问题描述

2 个解决方案

解决方案1
2 2017-02-27 12:15:02

解决方案2
1 2017-02-27 11:45:23

如何使用python读取HDFS目录中的文件

问题描述

2 个解决方案

解决方案1 2 2017-02-27 12:15:02

解决方案2 1 2017-02-27 11:45:23

解决方案1
2 2017-02-27 12:15:02

解决方案2
1 2017-02-27 11:45:23