简体   繁体   中英

Get a list of file names from HDFS using python

Hadoop noob here.

I've searched for some tutorials on getting started with hadoop and python without much success. I do not need to do any work with mappers and reducers yet, but it's more of an access issue.

As a part of Hadoop cluster, there are a bunch of .dat files on the HDFS.

In order to access those files on my client (local computer) using Python,

what do I need to have on my computer?

How do I query for filenames on HDFS ?

Any links would be helpful too.

As far as I've been able to tell there is no out-of-the-box solution for this, and most answers I've found have resorted to using calls to the hdfs command. I'm running on Linux, and have the same challenge. I've found the sh package to be useful. This handles running o/s commands for you and managing stdin/out/err.

See here for more info on it: https://amoffat.github.io/sh/

Not the neatest solution, but it's one line (ish) and uses standard packages.

Here's my cut-down code to grab an HDFS directory listing. It will list files and folders alike, so you might need to modify if you need to differentiate between them.

import sh
hdfsdir = '/somedirectory'
filelist = [ line.rsplit(None,1)[-1] for line in sh.hdfs('dfs','-ls',hdfsdir).split('\n') if len(line.rsplit(None,1))][1:]

My output - In this case these are all directories:

[u'/somedirectory/transaction_basket_fct/date_id=2015-01-01',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-02',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-03',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-04',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-05',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-06',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-07',
 u'/somedirectory/transaction_basket_fct/date_id=2015-01-08']

Let's break it down:

To run the hdfs dfs -ls /somedirectory command we can use the sh package like this:

import sh
sh.hdfs('dfs','-ls',hdfsdir)

sh allows you to call o/s commands seamlessly as if they were functions on the module. You pass command parameters as function parameters. Really neat.

For me this returns something like:

Found 366 items
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-01
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-02
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-03
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-04
drwxrwx---+  - impala hive          0 2016-05-10 13:52 /somedirectory/transaction_basket_fct/date_id=2015-01-05

Split that into lines based on new line characters using .split('\\n')

Obtain the last 'word' in the string using line.rsplit(None,1)[-1] .

To prevent issues with empty elements in the list use if len(line.rsplit(None,1))

Finally remove the first element in the list (the Found 366 items ) using [1:]

what do I need to have on my computer?

You need Hadoop installed and running and ofcourse, Python.

How do I query for filenames on HDFS ?

You can try something like this here. I haven't tested the code so don't just rely on it.

from subprocess import Popen, PIPE

process = Popen('hdfs dfs -cat filename.dat',shell=True,stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()

check for returncode, std_err
if:
    everything is OK, do whatever with stdout
else:
    do something in else condition

You can also look at Pydoop which is a Python API for Hadoop.

Although my example include shell=true , you can try running without it as it is a security risk. Why you shouldn't use shell=True ?

for the "query for filenames on HDFS" using just raw subprocess library for python 3:

from subprocess import Popen, PIPE
hdfs_path = '/path/to/the/designated/folder'
process = Popen(f'hdfs dfs -ls -h {hdfs_path}', shell=True, stdout=PIPE, stderr=PIPE)
std_out, std_err = process.communicate()
list_of_file_names = [fn.split(' ')[-1].split('/')[-1] for fn in std_out.decode().split('\n')[1:]][:-1]
list_of_file_names_with_full_address = [fn.split(' ')[-1] for fn in std_out.decode().split('\n')[1:]][:-1]

You should have login access to a node in the cluster. Let the cluster administrator pick the node and setup the account and inform you how to access the node securely. If you are the administrator, let me know if the cluster is local or remote and if remote then is it hosted on your computer, inside a corporation or on a 3rd party cloud and if so whose and then I can provide more relevant information.

To query file names in HDFS, login to a cluster node and run hadoop fs -ls [path] . Path is optional and if not provided, the files in your home directory are listed. If -R is provided as an option, then it lists all the files in path recursively. There are additional options for this command. For more information about this and other Hadoop file system shell commands see http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html .

An easy way to query HDFS file names in Python is to use esutil.hdfs.ls(hdfs_url='', recurse=False, full=False) , which executes hadoop fs -ls hdfs_url in a subprocess, plus it has functions for a number of other Hadoop file system shell commands (see the source at http://code.google.com/p/esutil/source/browse/trunk/esutil/hdfs.py ). esutil can be installed with pip install esutil . It is on PyPI at https://pypi.python.org/pypi/esutil , documentation for it is at http://code.google.com/p/esutil/ and its GitHub site is https://github.com/esheldon/esutil .

As JGC stated, the most straightforward thing you could do is start by logging onto (via ssh ) one of the nodes (a server that is participating in a Hadoop cluster) and verifying that you have the correct access controls and privileges to:

  • List your home directory using the HDFS client ie hdfs dfs -ls
  • List the directory of interest that lives in HDFS ie hdfs dfs -ls <absolute or relative path to HDFS directory>

Then, in Python, you should use subprocesses and the HDFS client to access the paths of interest, and use the -C flag to exclude unnecessary metadata (to avoid doing ugly post-processing later).

ie Popen(['hdfs', 'dfs', '-ls', '-C', dirname])

Afterwards, split the output on new lines and then you will have your list of paths.

Here's an example along with logging and error handling (including for when the directory/file doesn't exist):

from subprocess import Popen, PIPE
import logging
logger = logging.getLogger(__name__)

FAILED_TO_LIST_DIRECTORY_MSG = 'No such file or directory'

class HdfsException(Exception):
    pass

def hdfs_ls(dirname):
    """Returns list of HDFS directory entries."""
    logger.info('Listing HDFS directory ' + dirname)
    proc = Popen(['hdfs', 'dfs', '-ls', '-C', dirname], stdout=PIPE, stderr=PIPE)
    (out, err) = proc.communicate()
    if out:
        logger.debug('stdout:\n' + out)
    if proc.returncode != 0:
        errmsg = 'Failed to list HDFS directory "' + dirname + '", return code ' + str(proc.returncode)
        logger.error(errmsg)
        logger.error(err)
        if not FAILED_TO_LIST_DIRECTORY_MSG in err:
            raise HdfsException(errmsg)
        return []
    elif err:
        logger.debug('stderr:\n' + err)
    return out.splitlines()

# dat_files will contain a proper Python list of the paths to the '.dat' files you mentioned above.
dat_files = hdfs_ls('/hdfs-dir-with-dat-files/')

The answer by @JGC was a big help. I wanted a version that was a more transparent function instead of a harder to read one-liner; I also swapped the string parsing to use regex so that it is both more transparent and less brittle to changes in the hdfs syntax. This version looks like this, the same general approach as JGC:

import re
import sh
def get_hdfs_files(directory:str):
    '''
    Params: 
    directory: an HDFS directory e.g. /my/hdfs/location
    '''
    output = sh.hdfs('dfs','-ls',directory).split('\n')
    files = []
    for line in output:
        match = re.search(f'({re.escape(directory)}.*$)', line)
        if match:
            files.append(match.group(0))

    return files

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM