简体   繁体   中英

What's the way to extract file extension from file name in Python?

The file names are dynamic and I need to extract the file extension. The file names look like this: parallels-workstation-parallels-en_US-6.0.13976.769982.run.sh

20090209.02s1.1_sequence.txt
SRR002321.fastq.bz2
hello.tar.gz
ok.txt

For the first one I want to extract txt , for the second one I want to extract fastq.bz2 , for the third one I want to extract tar.gz .

I am using os module to get the file extension as:

import os.path
extension = os.path.splitext('hello.tar.gz')[1][1:]

This gives me only gz which is fine if the file name is ok.txt but for this one I want the extension to be tar.gz .

import os

def splitext(path):
    for ext in ['.tar.gz', '.tar.bz2']:
        if path.endswith(ext):
            return path[:-len(ext)], path[-len(ext):]
    return os.path.splitext(path)

assert splitext('20090209.02s1.1_sequence.txt')[1] == '.txt'
assert splitext('SRR002321.fastq.bz2')[1] == '.bz2'
assert splitext('hello.tar.gz')[1] == '.tar.gz'
assert splitext('ok.txt')[1] == '.txt'

Removing dot:

import os

def splitext(path):
    for ext in ['.tar.gz', '.tar.bz2']:
        if path.endswith(ext):
            path, ext = path[:-len(ext)], path[-len(ext):]
            break
    else:
        path, ext = os.path.splitext(path)
    return path, ext[1:]

assert splitext('20090209.02s1.1_sequence.txt')[1] == 'txt'
assert splitext('SRR002321.fastq.bz2')[1] == 'bz2'
assert splitext('hello.tar.gz')[1] == 'tar.gz'
assert splitext('ok.txt')[1] == 'txt'

Your rules are arbitrary, how is the computer supposed to guess when it's ok for the extension to have a . in it?

At best you'll have to have a set of exceptional extensions, eg {'.bz2', '.gz'} and add some extra logic yourself

>>> paths = """20090209.02s1.1_sequence.txt
... SRR002321.fastq.bz2
... hello.tar.gz
... ok.txt""".splitlines()
>>> import os
>>> def my_split_ext(path):
...     name, ext = os.path.splitext(path)
...     if ext in {'.bz2', '.gz'}:
...         name, ext2 = os.path.splitext(name)
...         ext = ext2 + ext
...     return name, ext
... 
>>> map(my_split_ext, paths)
[('20090209.02s1.1_sequence', '.txt'), ('SRR002321', '.fastq.bz2'), ('hello', '.tar.gz'), ('ok', '.txt')]
> import re
> re.search(r'\.(.*)', 'hello.tar.gz').groups()[0]
'tar.gz'

Obviously the above assumes there's a . , but it doesn't look like os.path will do what you want here.

Well, you could keep iterating on root until ext is empty. In other words:

filename = "hello.tar.gz"
extensions = []
root, ext = os.path.splitext(filename)
while ext:
    extensions.append(ext)
    root, ext = os.path.splitext(root)

# do something if extensions length is greater than 1

I know this is a very old topic, but for others coming across this topic I want to share my solution (I agree it depends on your program logic).

I only needed the base name without the extension, and you can splitext as often as you want, which makes spitext return (base,ext) where base is always the basename and ext only contains an extension if it found one. So for files with a single or double period (.tar.gz and .txt for instance) the following returns the base name always:

base = os.path.splitext(os.path.splitext(filename)[0])[0]

splittext usually is not a good option if you expect that your filenames contain dots, instead I prefer:

>> import re
>> re.compile("(?P<name>.+?)(\.(?P<extension>.{1,4}))?$").search("blabla.blublu.tmp").groupdict()
{'extension': 'tmp', 'name': 'blabla.blublu'}
>> re.compile("(?P<name>.+?)(\.(?P<extension>.{1,4}))?$").search("blabla.blublu.tmpmoreblabla").groupdict()
{'extension': None, 'name': 'blabla.blublu.tmpmoreblabla'}
>> re.compile("(?P<name>.+?)(\.(?P<extension>.{1,4}))?$").search("blabla.blublu.tmpmoreblabla.ext").groupdict()
{'extension': 'ext', 'name': 'blabla.blublu.tmpmoreblabla'}

just check the second case "blabla.blublu.tmpmoreblabla" , if that is a filename without extension, splittext still return tmpmoreblabla as extension, the only assumptions that you have with this code are:

  1. You always have non-empty string as input
  2. Your filename and extension could have any possible character
  3. Your file extension length is between 1 or 4 characters (if it has more characters and it won't be considered an extension but part of the name)
  4. Your string ends with the extension file

Of course you can use unnamed groups just removing ?P<> but I prefer named groups in this case

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM