使用 python 腳本從 hdfs (hadoop) 目錄獲取文件列表

Question

如何使用 python 腳本從 hdfs (hadoop) 目錄獲取文件列表？

我嘗試過以下行：

dir = sc.textFile("hdfs://127.0.0.1:1900/directory").collect()

該目錄有文件列表“file1，file2，file3 .... fileN”。 通過使用這條線，我只得到了所有的內容列表。 但我需要獲取文件名列表。

誰能幫我找出這個問題？

提前致謝。

Answer 1

使用子流程

import subprocess
p = subprocess.Popen("hdfs dfs -ls <HDFS Location> |  awk '{print $8}'",
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT)

for line in p.stdout.readlines():
    print line

編輯：沒有 python 的回答。 第一個選項也可用於遞歸打印所有子目錄。 最后一個重定向語句可以根據您的要求省略或更改。

hdfs dfs -ls -R <HDFS LOCATION> | awk '{print $8}' > output.txt
hdfs dfs -ls <HDFS LOCATION> | awk '{print $8}' > output.txt

編輯：更正 awk 命令中缺少的引號。

Answer 2

import subprocess

path = "/data"
args = "hdfs dfs -ls "+path+" | awk '{print $8}'"
proc = subprocess.Popen(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)

s_output, s_err = proc.communicate()
all_dart_dirs = s_output.split() #stores list of files and sub-directories in 'path'

Answer 3

為什么不讓 HDFS 客戶端使用-C標志來完成艱苦的工作，而不是依靠 awk 或 python 來打印感興趣的特定列？

即Popen(['hdfs', 'dfs', '-ls', '-C', dirname])

之后，將輸出拆分為新行，然后您將獲得路徑列表。

這是一個示例以及日志記錄和錯誤處理（包括目錄/文件不存在時的示例）：

from subprocess import Popen, PIPE
import logging
logger = logging.getLogger(__name__)

FAILED_TO_LIST_DIRECTORY_MSG = 'No such file or directory'

class HdfsException(Exception):
    pass

def hdfs_ls(dirname):
    """Returns list of HDFS directory entries."""
    logger.info('Listing HDFS directory ' + dirname)
    proc = Popen(['hdfs', 'dfs', '-ls', '-C', dirname], stdout=PIPE, stderr=PIPE)
    (out, err) = proc.communicate()
    if out:
        logger.debug('stdout:\n' + out)
    if proc.returncode != 0:
        errmsg = 'Failed to list HDFS directory "' + dirname + '", return code ' + str(proc.returncode)
        logger.error(errmsg)
        logger.error(err)
        if not FAILED_TO_LIST_DIRECTORY_MSG in err:
            raise HdfsException(errmsg)
        return []
    elif err:
        logger.debug('stderr:\n' + err)
    return out.splitlines()

Answer 4

對於蟒蛇 3：

    from subprocess import Popen, PIPE
hdfs_path = '/path/to/the/designated/folder'
process = Popen(f'hdfs dfs -ls -h {hdfs_path}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
list_of_file_names = [fn.split(' ')[-1].split('/')[-1] for fn in std_out.decode().readlines()[1:]][:-1]
list_of_file_names_with_full_address = [fn.split(' ')[-1] for fn in std_out.decode().readlines()[1:]][:-1]

Answer 5

使用以下內容：

hdfsdir = r"hdfs://VPS-DATA1:9000/dir/"
filepaths = [ line.rsplit(None,1)[-1] for line in sh.hdfs('dfs','-ls',hdfsdir).split('\n') if len(line.rsplit(None,1))][1:]

for path in filepaths:
    print(path)

Answer 6

獲取目錄中的 hdfs 文件列表：

hdfsdir = /path/to/hdfs/directory
    filelist = [ line.rsplit(None,1)[-1] for line in sh.hdfs('dfs','-ls',hdfsdir).split('\n') if len(line.rsplit(None,1))][1:]
    for path in filelist:
        #reading data file from HDFS 
        with hdfs.open(path, "r") as read_file:
            #do what u wanna do
            data = json.load(read_file)

此列表是 hdfs 目錄中所有文件的列表

filelist = [ line.rsplit(None,1)[-1] for line in sh.hdfs('dfs','-ls',hdfsdir).split('\n') if len(line.rsplit(None,1))][1:]

Answer 7

如何使用python腳本從hdfs（hadoop）目錄中獲取文件列表？

我試過以下行：

dir = sc.textFile("hdfs://127.0.0.1:1900/directory").collect()

該目錄具有文件列表“file1,file2,file3....fileN”。 通過使用該行，我只獲得了所有內容列表。 但我需要獲取文件名列表。

任何人都可以幫我找出這個問題嗎？

提前致謝。

使用 python 腳本從 hdfs (hadoop) 目錄獲取文件列表

問題描述

6 個解決方案

解決方案1
9 2017-04-04 18:48:17

解決方案2
2 2019-04-17 11:15:57

解決方案3
1 2020-04-11 14:57:59

解決方案4
0 2020-03-30 21:38:42

解決方案5
0 2022-09-30 09:09:17

解決方案6
0 2022-09-30 13:39:13

解決方案7
-10 2015-12-24 11:44:17

使用 python 腳本從 hdfs (hadoop) 目錄獲取文件列表

問題描述

6 個解決方案

解決方案1 9 2017-04-04 18:48:17

解決方案2 2 2019-04-17 11:15:57

解決方案3 1 2020-04-11 14:57:59

解決方案4 0 2020-03-30 21:38:42

解決方案5 0 2022-09-30 09:09:17

解決方案6 0 2022-09-30 13:39:13

解決方案7 -10 2015-12-24 11:44:17

解決方案1
9 2017-04-04 18:48:17

解決方案2
2 2019-04-17 11:15:57

解決方案3
1 2020-04-11 14:57:59

解決方案4
0 2020-03-30 21:38:42

解決方案5
0 2022-09-30 09:09:17

解決方案6
0 2022-09-30 13:39:13

解決方案7
-10 2015-12-24 11:44:17