简体   繁体   English

来自 python 的 Hadoop 命令

[英]Hadoop commands from python

I am trying to get some stats for a directory in hdfs.我正在尝试获取 hdfs 中某个目录的一些统计信息。 I am trying to get the no of files/subdirs and the size for each.我正在尝试获取文件/子目录的数量和每个文件的大小。 I started out thinking that I can do this in bash.我开始认为我可以在 bash 中做到这一点。

#!/bin/bash
OP=$(hadoop fs -ls hdfs://mydirectory)
echo $(wc -l < "$OP")

I only have this much so far and I quickly realised that python might be a better option for this.到目前为止我只有这么多,我很快意识到python可能是一个更好的选择。 However I am not able to figure out how to execute hadoop commands like hadoop fs -ls from python但是我无法弄清楚如何hadoop fs -ls from python执行 hadoop 命令,比如hadoop fs -ls from python

Try the following snippet:尝试以下代码段:

output = subprocess.Popen(["hadoop", "fs", "-ls", "/user"], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
for line in output.stdout:
    print(line)

Additionally, you can refer to this sub-process example , where you can get return status, output and error message separately.此外,您可以参考这个子流程示例,您可以在其中分别获取返回状态、输出和错误消息。

See https://docs.python.org/2/library/commands.html for your options, including how to get the return status (in case of an error).请参阅https://docs.python.org/2/library/commands.html了解您的选项,包括如何获取返回状态(发生错误时)。 The basic code you're missing is您缺少的基本代码是

import commands

hdir_list = commands.getoutput('hadoop fs -ls hdfs://mydirectory')

Yes: deprecated in 2.6, still useful in 2.7, but removed from Python 3. If that bothers you, switch to是:在 2.6 中已弃用,在 2.7 中仍然有用,但已从 Python 3 中删除。如果这让您感到困扰,请切换到

os.command (<code string>)

... or better yet use subprocess.call (introduced in 2.4). ...或者更好地使用subprocess.call (在 2.4 中引入)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM