简体   繁体   English

正则表达式:有没有办法忽略字符串中的特定字符集并仍然匹配?

[英]RegEx: Is there a way to ignore a specific set of characters in a string and still match?

Let's say I have some music files organized (poorly) by artist name, for example:假设我有一些按艺术家姓名组织(很差)的音乐文件,例如:

/data/myfolder/Jay Z/some_file1.mp3
/data/myfolder/Jay-Z/some_file2.mp3
/data/myfolder/JayZ/some_file3.mp3
/data/myfolder/Destiny's Child/some_file4.mp3
/data/myfolder/Destinys Child/some_file5.mp3

I want to run some batch operations using regex matching.我想使用正则表达式匹配运行一些批处理操作。 However, I want to ignore the special characters within the artist's names when finding my matches.但是,在查找我的匹配项时,我想忽略艺术家姓名中的特殊字符。 I could programmatically replace special characters with python, but I'm wondering if its possible to do it completely with the regex pattern .我可以用 python 以编程方式替换特殊字符,但我想知道它是否可以完全用正则表达式模式来完成。

For example, the following code would only work on some_file1.mp3 and some_file4.mp3 as it is currently written:例如,以下代码仅适用于当前编写的some_file1.mp3some_file4.mp3

import os
import re

artists = ["Jay Z", "Destiny's Child"]
root = "/data/myfolder/"

for filepath in os.listdir(root):
    for artist in artists:
        pattern = r"\/data\/myfolder\/{}\/.*.mp3".format(artist)
        match = re.search(pattern, filepath)

        if match:
            ...do some stuff...
           

Is there some way to modify my regex pattern from /\/data\/myfolder\/{}\/.*.mp3.format(artist) so that it would successfully match even when there is a dash, single quote, or other specified special character within the string?有没有办法从/\/data\/myfolder\/{}\/.*.mp3.format(artist)修改我的正则表达式模式,以便即使有破折号、单引号或其他指定字符串中的特殊字符? Basically, I'm trying to ignore the presence of certain characters anywhere in a string when looking for a match.基本上,在寻找匹配项时,我试图忽略字符串中任何位置存在的某些字符。

First things first, your for filepath in os.listdir(root) returns the list of subfolfders inside root , but not the files in them.首先,您for filepath in os.listdir(root)的 for 文件路径返回root内的子文件夹列表,但不返回其中的文件。 You need to use os.walk :您需要使用os.walk

for dirpath, dirnames, filenames in os.walk(root):
    if not dirnames:
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)

Now, if you want to use a regex that ignores any chars of your choice inside some fixed string used as part of a regex, you can only try the fuzzy matching capabilities of the PyPi regex .现在,如果您想使用一个忽略您选择的任何字符在用作正则表达式一部分的某个固定字符串中的正则表达式,您只能尝试PyPi 正则表达式的模糊匹配功能。 The idea is to remove all the ignored chars from the artists items, and then allow any amount of these character insertions in the artist subfolder part.这个想法是从artists项目中删除所有被忽略的字符,然后允许在艺术家子文件夹部分中插入任意数量的这些字符。

See the Python code:参见 Python 代码:

import regex, os
artists = ["Jay Z", "Destiny's Child"]
artists = [regex.sub(r"[',. -]+", "", s) for s in artists]
root = r'/data/myfolder'

        
for dirpath, dirnames, filenames in os.walk(root):
    if not dirnames:
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            for artist in artists:
                pattern = r"{}[\\/](?:{}){{i:[',. -]}}[\\/][^\\/]*\.mp3$".format(regex.escape(root), artist)
                match = regex.search(pattern, filepath)
                if match:
                    print(match.group())

Note the [\\/] is used to match both Windows and Linux folder separators.注意[\\/]用于匹配 Windows 和 Linux 文件夹分隔符。 I also added a space to the list of ignored chars.我还在忽略的字符列表中添加了一个空格。

The artists = [regex.sub(r"[',. -]+", "", s) for s in artists] is the prep step to remove ignored chars from the artists subfolder names. artists = [regex.sub(r"[',. -]+", "", s) for s in artists]是从artists子文件夹名称中删除忽略字符的准备步骤。

The regex looks like /data/myfolder[\\/](?:DestinysChild){i:[',. -]}[\\/][^\\/]*\.mp3$正则表达式看起来像/data/myfolder[\\/](?:DestinysChild){i:[',. -]}[\\/][^\\/]*\.mp3$ /data/myfolder[\\/](?:DestinysChild){i:[',. -]}[\\/][^\\/]*\.mp3$ : /data/myfolder[\\/](?:DestinysChild){i:[',. -]}[\\/][^\\/]*\.mp3$

  • /data/myfolder - a literal root part /data/myfolder - 文字根部分
  • [\\/] - a / or \ char [\\/] - 一个/\字符
  • (?:DestinysChild){i:[',. -]} (?:DestinysChild){i:[',. -]} - DestinyChild string with any amount of space, apostrophe, hyphen, dot or comma insertions (?:DestinysChild){i:[',. -]} - 包含任意数量空格、撇号、连字符、点或逗号插入的DestinyChild字符串
  • [\\/] - a / or \ char [\\/] - 一个/\字符
  • [^\\/]* - zero or more chars other than / and \ [^\\/]* - 除/\之外的零个或多个字符
  • \.mp3$ - .mp3 at the end of string. \.mp3$ - 字符串末尾的.mp3
pattern = re.compile("/data/myfolder/.*[^/]/.*.mp3")

try to do it like this.尝试这样做。

put it inside bracket [{}]+把它放在括号[{}]+

pattern = r"\/data\/myfolder\/[{}]+\/.*.mp3".format(artist)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM