简体   繁体   中英

RegEx: Is there a way to ignore a specific set of characters in a string and still match?

Let's say I have some music files organized (poorly) by artist name, for example:

/data/myfolder/Jay Z/some_file1.mp3
/data/myfolder/Jay-Z/some_file2.mp3
/data/myfolder/JayZ/some_file3.mp3
/data/myfolder/Destiny's Child/some_file4.mp3
/data/myfolder/Destinys Child/some_file5.mp3

I want to run some batch operations using regex matching. However, I want to ignore the special characters within the artist's names when finding my matches. I could programmatically replace special characters with python, but I'm wondering if its possible to do it completely with the regex pattern .

For example, the following code would only work on some_file1.mp3 and some_file4.mp3 as it is currently written:

import os
import re

artists = ["Jay Z", "Destiny's Child"]
root = "/data/myfolder/"

for filepath in os.listdir(root):
    for artist in artists:
        pattern = r"\/data\/myfolder\/{}\/.*.mp3".format(artist)
        match = re.search(pattern, filepath)

        if match:
            ...do some stuff...
           

Is there some way to modify my regex pattern from /\/data\/myfolder\/{}\/.*.mp3.format(artist) so that it would successfully match even when there is a dash, single quote, or other specified special character within the string? Basically, I'm trying to ignore the presence of certain characters anywhere in a string when looking for a match.

First things first, your for filepath in os.listdir(root) returns the list of subfolfders inside root , but not the files in them. You need to use os.walk :

for dirpath, dirnames, filenames in os.walk(root):
    if not dirnames:
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)

Now, if you want to use a regex that ignores any chars of your choice inside some fixed string used as part of a regex, you can only try the fuzzy matching capabilities of the PyPi regex . The idea is to remove all the ignored chars from the artists items, and then allow any amount of these character insertions in the artist subfolder part.

See the Python code:

import regex, os
artists = ["Jay Z", "Destiny's Child"]
artists = [regex.sub(r"[',. -]+", "", s) for s in artists]
root = r'/data/myfolder'

        
for dirpath, dirnames, filenames in os.walk(root):
    if not dirnames:
        for filename in filenames:
            filepath = os.path.join(dirpath, filename)
            for artist in artists:
                pattern = r"{}[\\/](?:{}){{i:[',. -]}}[\\/][^\\/]*\.mp3$".format(regex.escape(root), artist)
                match = regex.search(pattern, filepath)
                if match:
                    print(match.group())

Note the [\\/] is used to match both Windows and Linux folder separators. I also added a space to the list of ignored chars.

The artists = [regex.sub(r"[',. -]+", "", s) for s in artists] is the prep step to remove ignored chars from the artists subfolder names.

The regex looks like /data/myfolder[\\/](?:DestinysChild){i:[',. -]}[\\/][^\\/]*\.mp3$ /data/myfolder[\\/](?:DestinysChild){i:[',. -]}[\\/][^\\/]*\.mp3$ :

  • /data/myfolder - a literal root part
  • [\\/] - a / or \ char
  • (?:DestinysChild){i:[',. -]} (?:DestinysChild){i:[',. -]} - DestinyChild string with any amount of space, apostrophe, hyphen, dot or comma insertions
  • [\\/] - a / or \ char
  • [^\\/]* - zero or more chars other than / and \
  • \.mp3$ - .mp3 at the end of string.
pattern = re.compile("/data/myfolder/.*[^/]/.*.mp3")

try to do it like this.

put it inside bracket [{}]+

pattern = r"\/data\/myfolder\/[{}]+\/.*.mp3".format(artist)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM