RegEx排除目录，捕获用逗号分隔的文件名，排除“（number）”和扩展名

Question

I've been trying for the last three days (yeah) to make a image/short video tagging system for my own use but this has proven a challenge beyond me. 在过去的三天里，我一直在尝试制作自己使用的图像/短视频标记系统，但是事实证明，这是我面临的挑战。

These are the strings: 这些是字符串：

d:\images\tagging 1\GIFs\kung fu panda, fight.webm
d:\images\tagging 1\GIFs\kung fu panda, fight (2).webm
d:\images\tagging 1\GIFs\kung fu panda 2, fight.webm
d:\images\tagging 1\GIFs\kung fu panda 2, fight (2).webm
d:\images\tagging 1\GIFs\pulp fiction, samuel l. jackson, angry, funny.webm

I have four things that I've tried modifying to achieve what I want with no success: 我尝试修改四件事以实现我想要的目标但没有成功：

(?<=d:\\images\\tagging\s1\\GIFs\\)([\w\s])+

([a-z0-9]\s?)+

(?<=\\)[^\\]*?(?=\..*$)

[^\\/:*?"<>|\r\n]+$

1 Almost there, but it doesn't extend past the first comma. 1几乎在那里，但是没有超出第一个逗号。

2 This does almost everything, but I haven't found a way to exclude the directory, the (#) and the extension. 2这几乎可以完成所有操作，但是我还没有找到排除目录，（＃）和扩展名的方法。

3 Taken from the internet, captures the "l." 3从互联网上获取“ l”。 and stops there, whole filename, can't use commas as I want, captures (#). 并在此处停止，整个文件名，无法按我的要求使用逗号，捕获（＃）。

4 Taken from regexbuddy (yes I actually bought it in my desperation), captures (#) and extension. 4取自regexbuddy（是的，我实际上是在绝望中购买了它），捕获（＃）和扩展名。

@timgeb @timgeb

The intention is to get the filenames without the commas, the (#) and extension, so: 目的是获取不带逗号，（＃）和扩展名的文件名，因此：

"kung fu panda" "fight"
"kung fu panda" "fight"
"kung fu panda 2" "fight"
"kung fu panda 2" "fight"
"pulp fiction" "samuel l. jackson" "angry" "funny"

Answer 1

Your question isn't very clear, but I think you want to parse filenames. 您的问题不是很清楚，但我认为您想解析文件名。 If that's the case, I wouldn't recommend using re as your primary tool. 如果是这样，我不建议您将re用作主要工具。

Instead, have a look at os.path : 相反，请查看os.path ：

import os.path  # Or `import ntpath` for Windows paths on non-Windows systems

dir, file_name = os.path.split('d:\images\tagging 1\GIFs\kung fu panda, fight (2).webm')
# dir = 'd:\images\tagging 1\GIFs'
# file_name = 'kung fu panda, fight (2).webm'

root, ext = os.path.splitext(file_name)
# root = 'kung fu panda, fight (2)'
# ext = '.webm'

Now you have a much simpler problem: removing the numbers in parentheses. 现在，您有一个更简单的问题：删除括号中的数字。

Answer 2

Get the basename, substitute integers in parentheses and the extension with the empty string and strip off the whitespace. 获取基本名称，用空字符串替换括号内的整数和扩展名，并去除空格。

from ntpath import basename
import re
map(str.strip, re.sub('\(\d+\)|\.\w+$', '', basename(s)).split(','))

Demo: 演示：

>>> s = 'd:\images\tagging 1\GIFs\kung fu panda, fight.webm'
>>> map(str.strip, re.sub('\(\d+\)|\.\w+$', '', basename(s)).split(','))
['kung fu panda', 'fight']
>>> s = 'd:\images\tagging 1\GIFs\kung fu panda, fight (2).webm'
>>> map(str.strip, re.sub('\(\d+\)|\.\w+$', '', basename(s)).split(','))
['kung fu panda', 'fight']
>>> s = 'd:\images\tagging 1\GIFs\kung fu panda 2, fight.webm'
>>> map(str.strip, re.sub('\(\d+\)|\.\w+$', '', basename(s)).split(','))
['kung fu panda 2', 'fight']
>>> s = 'd:\images\tagging 1\GIFs\kung fu panda 2, fight (2).webm'
>>> map(str.strip, re.sub('\(\d+\)|\.\w+$', '', basename(s)).split(','))
['kung fu panda 2', 'fight']
>>> s = 'd:\images\tagging 1\GIFs\pulp fiction, samuel l. jackson, angry, funny.webm'
>>> map(str.strip, re.sub('\(\d+\)|\.\w+$', '', basename(s)).split(','))
['pulp fiction', 'samuel l. jackson', 'angry', 'funny']

Answer 3

If I got you, you want last tags (kung fu panda, fight.webm) that is after 1\\GIFs\\ - If you add more content-string then I can normalize code for you. 如果我得到了您，您想要的最后一个标签（功夫熊猫，fight.webm）在1\\GIFs\\ -如果您添加更多内容字符串，那么我可以为您标准化代码。 This code just extracts tags and generates a regular list. 此代码仅提取标签并生成常规列表。 import re 汇入

s="""d:\images\tagging 1\GIFs\kung fu panda, fight.webm
d:\images\tagging 1\GIFs\kung fu panda, fight (2).webm
d:\images\tagging 1\GIFs\kung fu panda 2, fight.webm
d:\images\tagging 1\GIFs\kung fu panda 2, fight (2).webm
d:\images\tagging 1\GIFs\pulp fiction, samuel l. jackson, angry, funny.webm"""

lines = s.split('\n')# Just generate a list of lines
for t in lines:
    data = re.search(r'1\\GIFs\\(.+$)',t)
    print data.group(1).split(',')

Output- 输出 -

['kung fu panda', ' fight.webm']
['kung fu panda', ' fight (2).webm']
['kung fu panda 2', ' fight.webm']
['kung fu panda 2', ' fight (2).webm']
['pulp fiction', ' samuel l. jackson', ' angry', ' funny.webm']

Expression 1\\\\GIFs\\\\(.+$) will capture last tags that is after 1\\\\GIFs 表达式1\\\\GIFs\\\\(.+$)将捕获1\\\\GIFs之后的最后一个标签

RegEx排除目录，捕获用逗号分隔的文件名，排除“（number）”和扩展名

问题描述

3 个解决方案

解决方案1
3 2016-01-24 16:39:43

解决方案2
1 2016-01-24 20:35:47

解决方案3
0 2016-01-24 16:38:05

SEE `LIVE-DEMO` 查看`现场演示`

RegEx排除目录，捕获用逗号分隔的文件名，排除“（number）”和扩展名

问题描述

3 个解决方案

解决方案1 3 2016-01-24 16:39:43

解决方案2 1 2016-01-24 20:35:47

解决方案3 0 2016-01-24 16:38:05

SEE LIVE-DEMO 查看现场演示

解决方案1
3 2016-01-24 16:39:43

解决方案2
1 2016-01-24 20:35:47

解决方案3
0 2016-01-24 16:38:05

SEE `LIVE-DEMO` 查看`现场演示`