简体   繁体   中英

Regex to split file paths into groups?

This is the structure of the files:

(number)/firstdirectory/unimportant/unimportant/lastdirectory.DAT

I need to write a regex that will place the number, the first directory, and the last directory in groups 1, 2, and 3 respectively.

example of other files(files I use to test):

(1)/Downloads/Maps/Map of Places.pdf
(25)/Publications/1995Publications.pdf
(31)/Table-of-Contents.pdf

This is what I have:

import re

reggie = r"^.* \(([0-9]*)\)(.*)\/([^\/]*)\.(.*)$"


with open('test2.txt') as f:
    lines = f.readlines()

for line in lines:
    match = re.search(reggie, line)
    if match:
        num = match.group(1)
        sub = match.group(2)
        file = match.group(3)
        print(num, sub, file)

What I hope to get is:

    1 Downloads Map of Places
    25 Publications 1995Publications
    31 Table-of-Contents (assumes theres no first directory and just takes the last)

What I end up getting is:

    1 /Downloads/Maps Map of Places
    25 /Publications 1963Publications
    31  Table of Contents

It's very close, the only problem is, when there's more than 2 directories, the middle ones are included with the first one and there's unnecessary forward slashes before the first directory.

I've been thinking about this for a couple hours, and I'm stumped. My best attempt was to force a forward slash after the number to remove the unnecessary ones in the output, then adding an optional one after the first directory, in cases where there's more than 2 directories.

Like this:

    reggie = r"^.*\(([0-9]*)\)\/(.*)\/*([^\/]*)\.(.*)$"

However, with this, all the directories merge into one and there is no last directory.

Any help would be appreciated, it seems like a simple solution, but I must be looking at it all wrong.

First of all regex is not the way to go. Pathlib should be used instead.

Here is the regex solution if you do wish to use it anyway:

import re
regex = re.compile(r"\((\d+)\)(?:/([^/]+))?.*/([^\.]+)\..*$")
paths = ["(1)/Downloads/Maps/Map of Places.pdf","(25)/Publications/1995Publications.pdf","(31)/Table-of-Contents.pdf"]
for path in paths:
    print(regex.match(path).groups())

Output:

('1', 'Downloads', 'Map of Places')
('25', 'Publications', '1995Publications')
('31', None, 'Table-of-Contents')

Instead of using a regex, you should use Pathlib . It is more reliable and supports different operating systems:

import pathlib
paths = ["(1)/Downloads/Maps/Map of Places.pdf","(25)/Publications/1995Publications.pdf","(31)/Table-of-Contents.pdf"]
for path in map(pathlib.PurePath, paths):  # Convert all paths to PurePaths
    path_parts = path.parts
    number = path_parts[0]
    filename = path.stem
    root_directory = path_parts[1] if len(path_parts) > 2 else None
    print((number, root_directory, filename))

Output:

('(1)', 'Downloads', 'Map of Places')
('(25)', 'Publications', '1995Publications')
('(31)', None, 'Table-of-Contents')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM