[英]Python : How to import list of files in directory from HDFS
I try to import list of files from HDFS in python.我尝试在 python 中从HDFS导入文件列表。
How to do this from HDFS :如何从 HDFS 执行此操作:
path =r'/my_path'
allFiles = glob.glob(path + "/*.csv")
df_list = []
for file_ in allFiles:
df = pd.read_csv(file_,index_col=None, header=0,sep=';')
df_list.append(df)
I think subprocess.Popen do the trick but how to extract only the filename ?我认为subprocess.Popen可以解决问题,但如何仅提取文件名?
import subprocess
p = subprocess.Popen("hdfs dfs -ls /my_path/ ",
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
for line in p.stdout.readlines():
print(line)
The output is like this :输出是这样的:
b'Found 32 items\n'
b'-rw------- 3 user hdfs 42202621 2019-01-21 10:05 /my_path/file1.csv\n'
b'-rw------- 3 user hdfs 99320020 2019-01-21 10:05 /my_path/file2.csv\n'
Declaimer : This will be a long and tedious.声明:这将是一个漫长而乏味的过程。 But given the circumstance, I'll try to make it as general and reproducible as possible.但鉴于这种情况,我会尽量使它具有普遍性和可重复性。
Given the requirement of no external libraries (except for pandas
?), there isn't must of a choice to take.鉴于不需要外部库( pandas
除外?),没有必须选择。 I suggest utilizing WebHDFS
as much as possible.我建议尽可能多地使用WebHDFS
。
AFAIK, installation of HDFS , by default, includes an installation of WebHDFS . AFAIK,默认情况下HDFS的安装包括WebHDFS的安装。 Following solution heavily relies on WebHDFS .以下解决方案严重依赖WebHDFS 。
To begin with, you must be aware of WebHDFS urls.首先,您必须了解WebHDFS url。 WebHDFS is installed on HDFS Namenode(s) , and default port is 50070 . WebHDFS安装在HDFS Namenode(s) 上,默认端口为50070 。
Therefore, we start with http://[namenode_ip]:50070/webhdfs/v1/
, where /webhdfs/v1
/ is a common url for all.因此,我们从http://[namenode_ip]:50070/webhdfs/v1/
,其中/webhdfs/v1
/ 是所有人的公共网址。
For the sake of example, let's assume it as http://192.168.10.1:50070/web/hdfs/v1
.为了举例,我们假设它为http://192.168.10.1:50070/web/hdfs/v1
。
Ordinarily, one can use curl
to list contents of a HDFS directory.通常,可以使用curl
列出 HDFS 目录的内容。 For detailed explanation, refer to WebHDFS REST API: List a Directory详细解释请参考WebHDFS REST API: List a Directory
If you were to use curl
, following provides FileStatuses
of all the files inside a given directory.如果您要使用curl
,以下提供给定目录中所有文件的FileStatuses
。
curl "http://192.168.10.1:50070/webhdfs/v1/<PATH>?op=LISTSTATUS"
^^^^^^^^^^^^ ^^^^^ ^^^^ ^^^^^^^^^^^^^
Namenode IP Port Path Operation
As mentioned, this returns FileStatuses in JSON object:如前所述,这将返回 JSON 对象中的 FileStatuses:
{
"FileStatuses":
{
"FileStatus":
[
{
"accessTime" : 1320171722771,
"blockSize" : 33554432,
"group" : "supergroup",
"length" : 24930,
"modificationTime": 1320171722771,
"owner" : "webuser",
"pathSuffix" : "a.patch",
"permission" : "644",
"replication" : 1,
"type" : "FILE"
},
{
"accessTime" : 0,
"blockSize" : 0,
"group" : "supergroup",
"length" : 0,
"modificationTime": 1320895981256,
"owner" : "szetszwo",
"pathSuffix" : "bar",
"permission" : "711",
"replication" : 0,
"type" : "DIRECTORY"
},
...
]
}
}
Same result can be achieved using python's default libs:使用 python 的默认库可以实现相同的结果:
import requests
my_path = '/my_path/'
curl = requests.get('http://192.168.10.1:50070/webhdfs/v1/%s?op=LISTSTATUS' % my_path)
And as shown above, the actual status of each file is two levels below of the result JSON.并且如上图所示,每个文件的实际状态比结果 JSON 低两级。 In other words, to get the FileStatus of each file:换句话说,要获取每个文件的 FileStatus:
curl.json()['FileStatuses']['FileStatus']
[
{
"accessTime" : 1320171722771,
"blockSize" : 33554432,
"group" : "supergroup",
"length" : 24930,
"modificationTime": 1320171722771,
"owner" : "webuser",
"pathSuffix" : "a.patch",
"permission" : "644",
"replication" : 1,
"type" : "FILE"
},
{
"accessTime" : 0,
"blockSize" : 0,
"group" : "supergroup",
"length" : 0,
"modificationTime": 1320895981256,
"owner" : "szetszwo",
"pathSuffix" : "bar",
"permission" : "711",
"replication" : 0,
"type" : "DIRECTORY"
},
...
]
Since you now have all the information you want, all you need to do is parsing.由于您现在拥有所需的所有信息,因此您需要做的就是解析。
import os
file_paths = []
for file_status in curl.json()['FileStatuses']['FileStatus']:
file_name = file_status['pathSuffix']
# this is the file name in the queried directory
if file_name.endswith('.csv'):
# if statement is only required if the directory contains unwanted files (i.e. non-csvs).
file_paths.append(os.path.join(path, file_name))
# os.path.join asserts your result consists of absolute path
file_paths
['/my_path/file1.csv',
'/my_path/file2.csv',
...]
Now you know the paths of files and WebHDFS links, pandas.read_csv
can handle rest of the works.现在您知道文件和 WebHDFS 链接的路径, pandas.read_csv
可以处理其余的工作。
import pandas as pd
dfs = []
web_url = "http://192.168.10.1:50070/webhdfs/v1/%s?op=OPEN"
# ^^^^^^^
# Operation is now OPEN
for file_path in file_paths:
file_url = web_url % file_path
# http://192.168.10.1:50070/webhdfs/v1/my_path/file1.csv?op=OPEN
dfs.append(pd.read_csv(file_url))
And there you go with all the .csv
s imported and assigned to dfs
.然后将所有.csv
导入并分配给dfs
。
If your HDFS is configured for HA (High Availability), there will be multiple namenodes and thus your namenode_ip
must be set accordingly: It must be the IP of an active node.如果您的 HDFS 配置为HA (高可用性),则会有多个namenode ,因此您的namenode_ip
必须相应设置:它必须是活动节点的 IP。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.