简体   繁体   English

在Python中从文件名中提取文件扩展名的方法是什么?

[英]What's the way to extract file extension from file name in Python?

The file names are dynamic and I need to extract the file extension. 文件名是动态的,我需要提取文件扩展名。 The file names look like this: parallels-workstation-parallels-en_US-6.0.13976.769982.run.sh 文件名如下所示:parallels-workstation-parallels-en_US-6.0.13976.769982.run.sh

20090209.02s1.1_sequence.txt
SRR002321.fastq.bz2
hello.tar.gz
ok.txt

For the first one I want to extract txt , for the second one I want to extract fastq.bz2 , for the third one I want to extract tar.gz . 对于第一个我想提取txt ,对于第二个我想提取fastq.bz2 ,为第三个我想提取tar.gz

I am using os module to get the file extension as: 我使用os模块获取文件扩展名为:

import os.path
extension = os.path.splitext('hello.tar.gz')[1][1:]

This gives me only gz which is fine if the file name is ok.txt but for this one I want the extension to be tar.gz . 这给了我唯一的GZ这是很好的,如果文件名是ok.txt但对于这一个我想要的扩展是tar.gz

import os

def splitext(path):
    for ext in ['.tar.gz', '.tar.bz2']:
        if path.endswith(ext):
            return path[:-len(ext)], path[-len(ext):]
    return os.path.splitext(path)

assert splitext('20090209.02s1.1_sequence.txt')[1] == '.txt'
assert splitext('SRR002321.fastq.bz2')[1] == '.bz2'
assert splitext('hello.tar.gz')[1] == '.tar.gz'
assert splitext('ok.txt')[1] == '.txt'

Removing dot: 删除点:

import os

def splitext(path):
    for ext in ['.tar.gz', '.tar.bz2']:
        if path.endswith(ext):
            path, ext = path[:-len(ext)], path[-len(ext):]
            break
    else:
        path, ext = os.path.splitext(path)
    return path, ext[1:]

assert splitext('20090209.02s1.1_sequence.txt')[1] == 'txt'
assert splitext('SRR002321.fastq.bz2')[1] == 'bz2'
assert splitext('hello.tar.gz')[1] == 'tar.gz'
assert splitext('ok.txt')[1] == 'txt'

Your rules are arbitrary, how is the computer supposed to guess when it's ok for the extension to have a . 你的规则是任意的,当扩展程序有一个时,计算机应该如何猜测. in it? 在里面?

At best you'll have to have a set of exceptional extensions, eg {'.bz2', '.gz'} and add some extra logic yourself 充其量你必须有一组特殊的扩展,例如{'.bz2', '.gz'}并自己添加一些额外的逻辑

>>> paths = """20090209.02s1.1_sequence.txt
... SRR002321.fastq.bz2
... hello.tar.gz
... ok.txt""".splitlines()
>>> import os
>>> def my_split_ext(path):
...     name, ext = os.path.splitext(path)
...     if ext in {'.bz2', '.gz'}:
...         name, ext2 = os.path.splitext(name)
...         ext = ext2 + ext
...     return name, ext
... 
>>> map(my_split_ext, paths)
[('20090209.02s1.1_sequence', '.txt'), ('SRR002321', '.fastq.bz2'), ('hello', '.tar.gz'), ('ok', '.txt')]
> import re
> re.search(r'\.(.*)', 'hello.tar.gz').groups()[0]
'tar.gz'

Obviously the above assumes there's a . 显然上面假设有一个. , but it doesn't look like os.path will do what you want here. ,但它看起来不像os.path会在这里做你想要的。

Well, you could keep iterating on root until ext is empty. 好吧,你可以继续迭代root,直到ext为空。 In other words: 换一种说法:

filename = "hello.tar.gz"
extensions = []
root, ext = os.path.splitext(filename)
while ext:
    extensions.append(ext)
    root, ext = os.path.splitext(root)

# do something if extensions length is greater than 1

I know this is a very old topic, but for others coming across this topic I want to share my solution (I agree it depends on your program logic). 我知道这是一个非常古老的话题,但对于遇到这个话题的其他人,我想分享我的解决方案(我同意这取决于你的程序逻辑)。

I only needed the base name without the extension, and you can splitext as often as you want, which makes spitext return (base,ext) where base is always the basename and ext only contains an extension if it found one. 我只需要没有扩展名的基本名称,你可以根据需要随时使用splitext,这使spitext返回(base,ext),其中base始终是basename,ext只包含扩展名,如果找到的话。 So for files with a single or double period (.tar.gz and .txt for instance) the following returns the base name always: 因此,对于具有单周期或双周期的文件(例如.tar.gz和.txt),以下内容始终返回基本名称:

base = os.path.splitext(os.path.splitext(filename)[0])[0]

splittext usually is not a good option if you expect that your filenames contain dots, instead I prefer: 如果您希望文件名包含点,则splittext通常不是一个好选项,而是我更喜欢:

>> import re
>> re.compile("(?P<name>.+?)(\.(?P<extension>.{1,4}))?$").search("blabla.blublu.tmp").groupdict()
{'extension': 'tmp', 'name': 'blabla.blublu'}
>> re.compile("(?P<name>.+?)(\.(?P<extension>.{1,4}))?$").search("blabla.blublu.tmpmoreblabla").groupdict()
{'extension': None, 'name': 'blabla.blublu.tmpmoreblabla'}
>> re.compile("(?P<name>.+?)(\.(?P<extension>.{1,4}))?$").search("blabla.blublu.tmpmoreblabla.ext").groupdict()
{'extension': 'ext', 'name': 'blabla.blublu.tmpmoreblabla'}

just check the second case "blabla.blublu.tmpmoreblabla" , if that is a filename without extension, splittext still return tmpmoreblabla as extension, the only assumptions that you have with this code are: 只检查第二个案例"blabla.blublu.tmpmoreblabla" ,如果这是一个没有扩展名的文件名,splittext仍然返回tmpmoreblabla作为扩展名,你对这段代码的唯一假设是:

  1. You always have non-empty string as input 您始终将非空字符串作为输入
  2. Your filename and extension could have any possible character 您的文件名和扩展名可能有任何可能的字符
  3. Your file extension length is between 1 or 4 characters (if it has more characters and it won't be considered an extension but part of the name) 您的文件扩展名长度介于1或4个字符之间(如果它有更多字符,则不会将其视为扩展名,而是名称的一部分)
  4. Your string ends with the extension file 您的字符串以扩展名文件结尾

Of course you can use unnamed groups just removing ?P<> but I prefer named groups in this case 当然你可以使用未命名的组只删除?P<>但在这种情况下我更喜欢命名组

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM