简体   繁体   English

Python-正则表达式从HDFS获取目录名称

[英]Python - Regex to Get directory name from HDFS

Im trying to extract a folder name from the result of a subprocess command. 我试图从子过程命令的结果中提取文件夹名称。 The result is Found 1 items 结果为找到1个项目

drwxr-xr-x   - user user          0 2017-05-04 17:19 /user/oozie/share/lib/lib_20170406204755

I want to extract lib_20170406204755 . 我想提取lib_20170406204755 I was able to do it using 我能够使用

process = subprocess.check_output(['hdfs','dfs','-ls','/user/oozie/share/lib'])
print process.split(' ')[-1].rstrip().split('/')[-1]

The folder is always lib_timestamp 该文件夹始终为lib_timestamp

How can do this using regex? 如何使用正则表达式执行此操作?

No regex required here, you may as well use split() : 这里不需要正则表达式,您也可以使用split()

string = "drwxr-xr-x   - user user          0 2017-05-04 17:19 /user/oozie/share/lib/lib_20170406204755"

folder = string.split('/')[-1]
print(folder)
# lib_20170406204755

But if you insist: 但是,如果您坚持:

[^/]+$


In Python : Python

 import re string = "drwxr-xr-x - user user 0 2017-05-04 17:19 /user/oozie/share/lib/lib_20170406204755" rx = re.compile(r'[^/]+$') folder = rx.search(string).group(0) print(folder) # lib_20170406204755 

See a demo on regex101.com . 参见regex101.com上的演示

This should do the trick: 这应该可以解决问题:

(?!/)(lib_\\d*)

This regex is searching for for something that starts with lib_ followed by a bunch of numbers, should be enough if no similar folders are found on the result. 此正则表达式正在搜索以lib_开头且后跟一堆数字的内容,如果在结果上未找到类似的文件夹,则该内容就足够了。

(?!/) is just to make sure that the folder is preceded by a / (?!/)只是为了确保该文件夹前面有一个/

Example

A clean approach would be to use the os.path module to pick apart paths. 一种干净的方法是使用os.path模块挑选路径。

import os
import subprocess

output = subprocess.check_output(['hdfs','dfs','-ls','/user/oozie/share/lib'])

# there are 8 columns in the output, i.e. we need a maximum of 7 splits per line
output_table = [line.split(maxsplit=7) for line in output.splitlines()]

# we are interested in the basename of that path
filenames = [os.path.basename(row[7]) for row in output_table]

with this test input: 使用此测试输入:

drwxr-xr-x   - user user          0 2017-05-04 17:19 /user/oozie/share/lib/lib_20170406204755
drwxr-xr-x   - user user          0 2017-05-04 17:19 /user/oozie/share/lib/lib_20110523212454

filenames will be ['lib_20170406204755', 'lib_20110523212454'] 文件名将为['lib_20170406204755', 'lib_20110523212454']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM