简体   繁体   English

使用 python 脚本从 hdfs (hadoop) 目录获取文件列表

[英]Get list of files from hdfs (hadoop) directory using python script

How to get a list of files from hdfs (hadoop) directory using python script?如何使用 python 脚本从 hdfs (hadoop) 目录获取文件列表?

I have tried with following line:我尝试过以下行:

dir = sc.textFile("hdfs://127.0.0.1:1900/directory").collect() dir = sc.textFile("hdfs://127.0.0.1:1900/directory").collect()

The directory have list of files "file1,file2,file3....fileN".该目录有文件列表“file1,file2,file3 .... fileN”。 By using the line i got all the content list only.通过使用这条线,我只得到了所有的内容列表。 But i need to get list of file names.但我需要获取文件名列表。

Can anyone please help me to find out this problem?谁能帮我找出这个问题?

Thanks in advance.提前致谢。

Use subprocess使用子流程

import subprocess
p = subprocess.Popen("hdfs dfs -ls <HDFS Location> |  awk '{print $8}'",
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT)

for line in p.stdout.readlines():
    print line

EDIT: Answer without python.编辑:没有 python 的回答。 The first option can be used to recursively print all the sub-directories as well.第一个选项也可用于递归打印所有子目录。 The last redirect statement can be omitted or changed based on your requirement.最后一个重定向语句可以根据您的要求省略或更改。

hdfs dfs -ls -R <HDFS LOCATION> | awk '{print $8}' > output.txt
hdfs dfs -ls <HDFS LOCATION> | awk '{print $8}' > output.txt

EDIT: Correcting a missing quote in awk command.编辑:更正 awk 命令中缺少的引号。

import subprocess

path = "/data"
args = "hdfs dfs -ls "+path+" | awk '{print $8}'"
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)

s_output, s_err = proc.communicate()
all_dart_dirs = s_output.split() #stores list of files and sub-directories in 'path'

Why not have the HDFS client do the hard work by using the -C flag instead of relying on awk or python to print the specific columns of interest?为什么不让 HDFS 客户端使用-C标志来完成艰苦的工作,而不是依靠 awk 或 python 来打印感兴趣的特定列?

ie Popen(['hdfs', 'dfs', '-ls', '-C', dirname])Popen(['hdfs', 'dfs', '-ls', '-C', dirname])

Afterwards, split the output on new lines and then you will have your list of paths.之后,将输出拆分为新行,然后您将获得路径列表。

Here's an example along with logging and error handling (including for when the directory/file doesn't exist):这是一个示例以及日志记录和错误处理(包括目录/文件不存在时的示例):

from subprocess import Popen, PIPE
import logging
logger = logging.getLogger(__name__)

FAILED_TO_LIST_DIRECTORY_MSG = 'No such file or directory'

class HdfsException(Exception):
    pass

def hdfs_ls(dirname):
    """Returns list of HDFS directory entries."""
    logger.info('Listing HDFS directory ' + dirname)
    proc = Popen(['hdfs', 'dfs', '-ls', '-C', dirname], stdout=PIPE, stderr=PIPE)
    (out, err) = proc.communicate()
    if out:
        logger.debug('stdout:\n' + out)
    if proc.returncode != 0:
        errmsg = 'Failed to list HDFS directory "' + dirname + '", return code ' + str(proc.returncode)
        logger.error(errmsg)
        logger.error(err)
        if not FAILED_TO_LIST_DIRECTORY_MSG in err:
            raise HdfsException(errmsg)
        return []
    elif err:
        logger.debug('stderr:\n' + err)
    return out.splitlines()

For python 3:对于蟒蛇 3:

    from subprocess import Popen, PIPE
hdfs_path = '/path/to/the/designated/folder'
process = Popen(f'hdfs dfs -ls -h {hdfs_path}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
list_of_file_names = [fn.split(' ')[-1].split('/')[-1] for fn in std_out.decode().readlines()[1:]][:-1]
list_of_file_names_with_full_address = [fn.split(' ')[-1] for fn in std_out.decode().readlines()[1:]][:-1]

use the following:使用以下内容:

hdfsdir = r"hdfs://VPS-DATA1:9000/dir/"
filepaths = [ line.rsplit(None,1)[-1] for line in sh.hdfs('dfs','-ls',hdfsdir).split('\n') if len(line.rsplit(None,1))][1:]

for path in filepaths:
    print(path)

to get list of hdfs files in a drectory:获取目录中的 hdfs 文件列表:

hdfsdir = /path/to/hdfs/directory
    filelist = [ line.rsplit(None,1)[-1] for line in sh.hdfs('dfs','-ls',hdfsdir).split('\n') if len(line.rsplit(None,1))][1:]
    for path in filelist:
        #reading data file from HDFS 
        with hdfs.open(path, "r") as read_file:
            #do what u wanna do
            data = json.load(read_file)

this list is a list of all files in hdfs directory此列表是 hdfs 目录中所有文件的列表

filelist = [ line.rsplit(None,1)[-1] for line in sh.hdfs('dfs','-ls',hdfsdir).split('\n') if len(line.rsplit(None,1))][1:]

How to get a list of files from hdfs (hadoop) directory using python script?如何使用python脚本从hdfs(hadoop)目录中获取文件列表?

I have tried with following line:我试过以下行:

dir = sc.textFile("hdfs://127.0.0.1:1900/directory").collect() dir = sc.textFile("hdfs://127.0.0.1:1900/directory").collect()

The directory have list of files "file1,file2,file3....fileN".该目录具有文件列表“file1,file2,file3....fileN”。 By using the line i got all the content list only.通过使用该行,我只获得了所有内容列表。 But i need to get list of file names.但我需要获取文件名列表。

Can anyone please help me to find out this problem?任何人都可以帮我找出这个问题吗?

Thanks in advance.提前致谢。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python:如何从 HDFS 导入目录中的文件列表 - Python : How to import list of files in directory from HDFS 如何使用 Python 在 HDFS 中查看传入文件的目录? (Python Script 由 Docker Container 执行;HDFS 中没有 Cronjob) - How to watch a Directory in HDFS for incoming Files using Python? (Python Script is executed by Docker Container; No Cronjob in HDFS) 使用 python 脚本获取目录中添加的删除和修改文件的列表 - To get list of added removed and modified files in directory using python script 使用 Python 从 HDFS 目录读取文件并在 Spark 中创建 RDD - Reading files from HDFS directory and creating a RDD in Spark using Python 如何使用 Python pickle 将文件转储到 Hadoop HDFS 目录? - How to dump a file to a Hadoop HDFS directory using Python pickle? 使用 Python 如何获取 HDFS 文件夹中所有文件的列表? - Using Python how to get list of all files in a HDFS folder? 如何使用python读取HDFS目录中的文件 - How to read files in HDFS directory using python 使用python并生成pandas dataframe在hdfs递归目录和子目录查找中可用的文件列表 - list of files available in hdfs recursive directory and sub directory lookup using python and generating pandas dataframe Python-正则表达式从HDFS获取目录名称 - Python - Regex to Get directory name from HDFS Hadoop:无法使用 python 连接到 HDFS(Hadoop) - Hadoop: Failed to connect to HDFS(Hadoop) using python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM