简体   繁体   中英

Regex to match the first file in a rar archive file set in Python

I need to uncompress all the files in a directory and for this I need to find the first file in the set. I'm currently doing this using a bunch of if statements and loops. Can i do this this using regex?

Here's a list of files that i need to match:

yes.rar
yes.part1.rar
yes.part01.rar
yes.part001.rar
yes.r01
yes.r001

These should NOT be matched:

no.part2.rar
no.part02.rar
no.part002.rar
no.part011.rar
no.r002
no.r02

I found a similar regex on this thread but it seems that Python doesn't support varible length lookarounds. A single line regex would be complicated but I'll document it well and it's not a problem. It's just one of those problems you beat your heap up, over.

Thanks in advance guys.

:)

Don't rely on the names of the files to determine which one is first. You're going to end up finding an edge case where you get the wrong file.

RAR's headers will tell you which file is the first on in the volume, assuming they were created in a somewhat-recent version of RAR.

HEAD_FLAGS Bit flags:
2 bytes

0x0100 - First volume (set only by RAR 3.0 and later)

So open up each file and examine the RAR headers, looking specifically for the flag that indicates which file is the first volume. This will never fail, as long as the archive isn't corrupt.


Update : I've just confirmed this by taking a look at some spanning archives in a hex editor. The files headers are constructed exactly as the link above indicates. It's just a matter of opening the files and reading the header for that flag. The file with that flag is the first volume.

There's no need to use look behind assertions for this. Since you start looking from the beginning of the string, you can do everything with look-aheads that you can with look-behinds. This should work:

^((?!\.part(?!0*1\.rar$)\d+\.rar$).)*\.(?:rar|r?0*1)$

To capture the first part of the filename as you requested, you could do this:

^((?:(?!\.part\d+\.rar$).)*)\.(?:(?:part0*1\.)?rar|r?0*1)$

Are you sure you want to match these cases?

yes.r01

They are not the first archives: .rar always is.

It's bla.rar, bla.r00 and then only bla.r01. You'll probably extract the files twice if you match .r01 and .rar as first archive.

yes.r001

.r001 doesn't exist. Do you mean the .001 files that WinRAR supports? After .r99, it's .s00. If it does exist, then somebody manually renamed the files.

In theory, matching on filename should be as reliable as matching on the 0x0100 flag to find the first archive.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM