Python：如何从 HDFS 导入目录中的文件列表

Question

I try to import list of files from HDFS in python.我尝试在 python 中从HDFS导入文件列表。

How to do this from HDFS :如何从 HDFS 执行此操作：

path =r'/my_path'
allFiles = glob.glob(path + "/*.csv")

df_list = []
for file_ in allFiles:
    df = pd.read_csv(file_,index_col=None, header=0,sep=';')    
    df_list.append(df)

I think subprocess.Popen do the trick but how to extract only the filename ?我认为subprocess.Popen可以解决问题，但如何仅提取文件名？

import subprocess
p = subprocess.Popen("hdfs dfs -ls /my_path/ ",
    shell=True,
    stdout=subprocess.PIPE,
    stderr=subprocess.STDOUT)


for line in p.stdout.readlines():
    print(line)

The output is like this :输出是这样的：

b'Found 32 items\n'
b'-rw-------   3 user hdfs   42202621 2019-01-21 10:05 /my_path/file1.csv\n'
b'-rw-------   3 user hdfs   99320020 2019-01-21 10:05 /my_path/file2.csv\n'

Answer 1

Declaimer : This will be a long and tedious.声明：这将是一个漫长而乏味的过程。 But given the circumstance, I'll try to make it as general and reproducible as possible.但鉴于这种情况，我会尽量使它具有普遍性和可重复性。

Given the requirement of no external libraries (except for pandas ?), there isn't must of a choice to take.鉴于不需要外部库（ pandas除外？），没有必须选择。 I suggest utilizing WebHDFS as much as possible.我建议尽可能多地使用WebHDFS 。

AFAIK, installation of HDFS , by default, includes an installation of WebHDFS . AFAIK，默认情况下HDFS的安装包括WebHDFS的安装。 Following solution heavily relies on WebHDFS .以下解决方案严重依赖WebHDFS 。

First Step第一步

To begin with, you must be aware of WebHDFS urls.首先，您必须了解WebHDFS url。 WebHDFS is installed on HDFS Namenode(s) , and default port is 50070 . WebHDFS安装在HDFS Namenode(s) 上，默认端口为50070 。

Therefore, we start with http://[namenode_ip]:50070/webhdfs/v1/ , where /webhdfs/v1 / is a common url for all.因此，我们从http://[namenode_ip]:50070/webhdfs/v1/ ，其中/webhdfs/v1 / 是所有人的公共网址。

For the sake of example, let's assume it as http://192.168.10.1:50070/web/hdfs/v1 .为了举例，我们假设它为http://192.168.10.1:50070/web/hdfs/v1 。

Second Step第二步

Ordinarily, one can use curl to list contents of a HDFS directory.通常，可以使用curl列出 HDFS 目录的内容。 For detailed explanation, refer to WebHDFS REST API: List a Directory详细解释请参考WebHDFS REST API: List a Directory

If you were to use curl , following provides FileStatuses of all the files inside a given directory.如果您要使用curl ，以下提供给定目录中所有文件的FileStatuses 。

curl "http://192.168.10.1:50070/webhdfs/v1/<PATH>?op=LISTSTATUS"
             ^^^^^^^^^^^^ ^^^^^             ^^^^  ^^^^^^^^^^^^^
             Namenode IP  Port              Path  Operation

As mentioned, this returns FileStatuses in JSON object:如前所述，这将返回 JSON 对象中的 FileStatuses：

{
  "FileStatuses":
  {
    "FileStatus":
    [
      {
        "accessTime"      : 1320171722771,
        "blockSize"       : 33554432,
        "group"           : "supergroup",
        "length"          : 24930,
        "modificationTime": 1320171722771,
        "owner"           : "webuser",
        "pathSuffix"      : "a.patch",
        "permission"      : "644",
        "replication"     : 1,
        "type"            : "FILE"
      },
      {
        "accessTime"      : 0,
        "blockSize"       : 0,
        "group"           : "supergroup",
        "length"          : 0,
        "modificationTime": 1320895981256,
        "owner"           : "szetszwo",
        "pathSuffix"      : "bar",
        "permission"      : "711",
        "replication"     : 0,
        "type"            : "DIRECTORY"
      },
      ...
    ]
  }
}

Same result can be achieved using python's default libs:使用 python 的默认库可以实现相同的结果：

import requests

my_path = '/my_path/'
curl = requests.get('http://192.168.10.1:50070/webhdfs/v1/%s?op=LISTSTATUS' % my_path)

And as shown above, the actual status of each file is two levels below of the result JSON.并且如上图所示，每个文件的实际状态比结果 JSON 低两级。 In other words, to get the FileStatus of each file:换句话说，要获取每个文件的 FileStatus：

curl.json()['FileStatuses']['FileStatus'] 

[
  {
    "accessTime"      : 1320171722771,
    "blockSize"       : 33554432,
    "group"           : "supergroup",
    "length"          : 24930,
    "modificationTime": 1320171722771,
    "owner"           : "webuser",
    "pathSuffix"      : "a.patch",
    "permission"      : "644",
    "replication"     : 1,
    "type"            : "FILE"
  },
  {
    "accessTime"      : 0,
    "blockSize"       : 0,
    "group"           : "supergroup",
    "length"          : 0,
    "modificationTime": 1320895981256,
    "owner"           : "szetszwo",
    "pathSuffix"      : "bar",
    "permission"      : "711",
    "replication"     : 0,
    "type"            : "DIRECTORY"
  },
  ...
]

Third Step第三步

Since you now have all the information you want, all you need to do is parsing.由于您现在拥有所需的所有信息，因此您需要做的就是解析。

import os

file_paths = []
for file_status in curl.json()['FileStatuses']['FileStatus']:
    file_name = file_status['pathSuffix']
    # this is the file name in the queried directory
    if file_name.endswith('.csv'):
    # if statement is only required if the directory contains unwanted files (i.e. non-csvs).
        file_paths.append(os.path.join(path, file_name))
        # os.path.join asserts your result consists of absolute path

file_paths
['/my_path/file1.csv',
 '/my_path/file2.csv',
 ...]

Final Step最后一步

Now you know the paths of files and WebHDFS links, pandas.read_csv can handle rest of the works.现在您知道文件和 WebHDFS 链接的路径， pandas.read_csv可以处理其余的工作。

import pandas as pd

dfs = []
web_url = "http://192.168.10.1:50070/webhdfs/v1/%s?op=OPEN"
#                                                  ^^^^^^^
#                                    Operation is now OPEN
for file_path in file_paths:
    file_url = web_url % file_path
    # http://192.168.10.1:50070/webhdfs/v1/my_path/file1.csv?op=OPEN
    dfs.append(pd.read_csv(file_url))

And there you go with all the .csv s imported and assigned to dfs .然后将所有.csv导入并分配给dfs 。

Warnings警告

If your HDFS is configured for HA (High Availability), there will be multiple namenodes and thus your namenode_ip must be set accordingly: It must be the IP of an active node.如果您的 HDFS 配置为HA （高可用性），则会有多个namenode ，因此您的namenode_ip必须相应设置：它必须是活动节点的 IP。

Python：如何从 HDFS 导入目录中的文件列表

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-01-24 12:46:33

First Step第一步

Second Step第二步

Third Step第三步

Final Step最后一步

Warnings警告

Python：如何从 HDFS 导入目录中的文件列表

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-01-24 12:46:33

First Step第一步

Second Step第二步

Third Step第三步

Final Step最后一步

Warnings警告

解决方案1
1 已采纳 2019-01-24 12:46:33