简体   繁体   English

RegEx排除目录,捕获用逗号分隔的文件名,排除“(number)”和扩展名

[英]RegEx to exclude directory, capture filename that are separated by commas, exclude “(number)” and extensions

I've been trying for the last three days (yeah) to make a image/short video tagging system for my own use but this has proven a challenge beyond me. 在过去的三天里,我一直在尝试制作自己使用的图像/短视频标记系统,但是事实证明,这是我面临的挑战。

These are the strings: 这些是字符串:

d:\images\tagging 1\GIFs\kung fu panda, fight.webm
d:\images\tagging 1\GIFs\kung fu panda, fight (2).webm
d:\images\tagging 1\GIFs\kung fu panda 2, fight.webm
d:\images\tagging 1\GIFs\kung fu panda 2, fight (2).webm
d:\images\tagging 1\GIFs\pulp fiction, samuel l. jackson, angry, funny.webm

I have four things that I've tried modifying to achieve what I want with no success: 我尝试修改四件事以实现我想要的目标但没有成功:

(?<=d:\\images\\tagging\s1\\GIFs\\)([\w\s])+

([a-z0-9]\s?)+

(?<=\\)[^\\]*?(?=\..*$)

[^\\/:*?"<>|\r\n]+$

1 Almost there, but it doesn't extend past the first comma. 1几乎在那里,但是没有超出第一个逗号。

2 This does almost everything, but I haven't found a way to exclude the directory, the (#) and the extension. 2这几乎可以完成所有操作,但是我还没有找到排除目录,(#)和扩展名的方法。

3 Taken from the internet, captures the "l." 3从互联网上获取“ l”。 and stops there, whole filename, can't use commas as I want, captures (#). 并在此处停止,整个文件名,无法按我的要求使用逗号,捕获(#)。

4 Taken from regexbuddy (yes I actually bought it in my desperation), captures (#) and extension. 4取自regexbuddy(是的,我实际上是在绝望中购买了它),捕获(#)和扩展名。

@timgeb @timgeb

The intention is to get the filenames without the commas, the (#) and extension, so: 目的是获取不带逗号,(#)和扩展名的文件名,因此:

"kung fu panda" "fight"
"kung fu panda" "fight"
"kung fu panda 2" "fight"
"kung fu panda 2" "fight"
"pulp fiction" "samuel l. jackson" "angry" "funny"

Your question isn't very clear, but I think you want to parse filenames. 您的问题不是很清楚,但我认为您想解析文件名。 If that's the case, I wouldn't recommend using re as your primary tool. 如果是这样,我不建议您将re用作主要工具。

Instead, have a look at os.path : 相反,请查看os.path

import os.path  # Or `import ntpath` for Windows paths on non-Windows systems

dir, file_name = os.path.split('d:\images\tagging 1\GIFs\kung fu panda, fight (2).webm')
# dir = 'd:\images\tagging 1\GIFs'
# file_name = 'kung fu panda, fight (2).webm'

root, ext = os.path.splitext(file_name)
# root = 'kung fu panda, fight (2)'
# ext = '.webm'

Now you have a much simpler problem: removing the numbers in parentheses. 现在,您有一个更简单的问题:删除括号中的数字。

Get the basename, substitute integers in parentheses and the extension with the empty string and strip off the whitespace. 获取基本名称,用空字符串替换括号内的整数和扩展名,并去除空格。

from ntpath import basename
import re
map(str.strip, re.sub('\(\d+\)|\.\w+$', '', basename(s)).split(','))

Demo: 演示:

>>> s = 'd:\images\tagging 1\GIFs\kung fu panda, fight.webm'
>>> map(str.strip, re.sub('\(\d+\)|\.\w+$', '', basename(s)).split(','))
['kung fu panda', 'fight']
>>> s = 'd:\images\tagging 1\GIFs\kung fu panda, fight (2).webm'
>>> map(str.strip, re.sub('\(\d+\)|\.\w+$', '', basename(s)).split(','))
['kung fu panda', 'fight']
>>> s = 'd:\images\tagging 1\GIFs\kung fu panda 2, fight.webm'
>>> map(str.strip, re.sub('\(\d+\)|\.\w+$', '', basename(s)).split(','))
['kung fu panda 2', 'fight']
>>> s = 'd:\images\tagging 1\GIFs\kung fu panda 2, fight (2).webm'
>>> map(str.strip, re.sub('\(\d+\)|\.\w+$', '', basename(s)).split(','))
['kung fu panda 2', 'fight']
>>> s = 'd:\images\tagging 1\GIFs\pulp fiction, samuel l. jackson, angry, funny.webm'
>>> map(str.strip, re.sub('\(\d+\)|\.\w+$', '', basename(s)).split(','))
['pulp fiction', 'samuel l. jackson', 'angry', 'funny']

If I got you, you want last tags (kung fu panda, fight.webm) that is after 1\\GIFs\\ - If you add more content-string then I can normalize code for you. 如果我得到了您,您想要的最后一个标签(功夫熊猫,fight.webm)在1\\GIFs\\ -如果您添加更多内容字符串,那么我可以为您标准化代码。 This code just extracts tags and generates a regular list. 此代码仅提取标签并生成常规列表。 import re 汇入

s="""d:\images\tagging 1\GIFs\kung fu panda, fight.webm
d:\images\tagging 1\GIFs\kung fu panda, fight (2).webm
d:\images\tagging 1\GIFs\kung fu panda 2, fight.webm
d:\images\tagging 1\GIFs\kung fu panda 2, fight (2).webm
d:\images\tagging 1\GIFs\pulp fiction, samuel l. jackson, angry, funny.webm"""

lines = s.split('\n')# Just generate a list of lines
for t in lines:
    data = re.search(r'1\\GIFs\\(.+$)',t)
    print data.group(1).split(',')

Output- 输出 -

['kung fu panda', ' fight.webm']
['kung fu panda', ' fight (2).webm']
['kung fu panda 2', ' fight.webm']
['kung fu panda 2', ' fight (2).webm']
['pulp fiction', ' samuel l. jackson', ' angry', ' funny.webm']

Expression 1\\\\GIFs\\\\(.+$) will capture last tags that is after 1\\\\GIFs 表达式1\\\\GIFs\\\\(.+$)将捕获1\\\\GIFs之后的最后一个标签

SEE LIVE-DEMO 查看现场演示

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM