简体   繁体   English

使用 Python 如何获取 HDFS 文件夹中所有文件的列表?

[英]Using Python how to get list of all files in a HDFS folder?

I would like to return a listing of all files in a HDFS folder using Python or preferably Pandas in a data frame.我想在数据框中使用 Python 或最好是 Pandas 返回 HDFS 文件夹中所有文件的列表。 I have looked at subprocess.Popen and that may be the best way but if so is there a way to parse out all the noise and only return the file names?我看过 subprocess.Popen ,这可能是最好的方法,但如果是的话,有没有办法解析出所有的噪音,只返回文件名?

the hdfs module is out as can't get the config options. hdfs 模块因无法获取配置选项而失效。 Tried subprocess.Popen but it returns so much extranious stuff.试过 subprocess.Popen 但它返回了很多无关紧要的东西。

Once you've named the path一旦你命名了路径

from pathlib import Path

folder = Path("/tmp/favorite_folder/")

then it's just a matter of globbing some pattern, like folder.glob("*.csv") .那么这只是对一些模式进行通配的问题,比如folder.glob folder.glob("*.csv") Use wildcard to get all names at single level:使用通配符获取单个级别的所有名称:

print(folder.glob("*"))

To recurse through all levels, you might wish to rely on os.walk() .要递归所有级别,您可能希望依赖os.walk()

https://docs.python.org/3/library/os.html#os.walk https://docs.python.org/3/library/os.html#os.walk

Or, use a recursive glob pattern: folder.glob("**/*.csv")或者,使用递归 glob 模式: folder.glob("**/*.csv")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM