简体   繁体   English

Python - 检查文件名中的确切字符串

[英]Python - Check for exact string in file name

I have a folder where each file is named after a number (ie img 1, img 2, img-3, 4-img, etc).我有一个文件夹,其中每个文件都以数字命名(即 img 1、img 2、img-3、4-img 等)。 I want to get files by exact string (so if I enter '4' as an input, it should only return files with '4' and not any files containing '14' or 40', for example. My problem is that the program returns all files as long as it matches the string. Note, the numbers aren't always in the same spot (for same files its at the end, for others it's in the middle)我想通过确切的字符串获取文件(因此,如果我输入“4”作为输入,它应该只返回带有“4”的文件,而不是任何包含“14”或“40”的文件,例如。我的问题是程序只要它与字符串匹配,就会返回所有文件。请注意,数字并不总是在同一位置(对于相同的文件,它在末尾,对于其他文件,它在中间)

For instance, if my folder has the files ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4', 'ep.4.', 'ep.4 ', 'ep. 4. ',ep4xxx, 'ep 4 ', '404ep']例如,如果我的文件夹中有文件['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4', 'ep.4.', 'ep.4 ', 'ep. 4. ',ep4xxx, 'ep 4 ', '404ep'] ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4', 'ep.4.', 'ep.4 ', 'ep. 4. ',ep4xxx, 'ep 4 ', '404ep'] ,and I want only files with the exact number 4 in them, then I would only want to return ['ep 4', 'img4', '4xxx','file 4.mp4','ep.4.','ep.4 ', 'ep. 4. ',ep4xxx,'ep 4 ','404ep] ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4', 'ep.4.', 'ep.4 ', 'ep. 4. ',ep4xxx, 'ep 4 ', '404ep'] ,我只想要其中包含确切数字4的文件,那么我只想返回['ep 4', 'img4', '4xxx','file 4.mp4','ep.4.','ep.4 ', 'ep. 4. ',ep4xxx,'ep 4 ','404ep] ['ep 4', 'img4', '4xxx','file 4.mp4','ep.4.','ep.4 ', 'ep. 4. ',ep4xxx,'ep 4 ','404ep]

here is what I have (in this case I only want to return all mp4 file type)这是我所拥有的(在这种情况下我只想返回所有 mp4 文件类型)

for (root, dirs, file) in os.walk(source_folder):
    for f in file:
        if '.mp4' and ('4') in f:
            print(f)

Tried == instead of in试过==而不是in

Judging by your inputs, your desired regular expression needs to meet the following criteria:根据您的输入判断,您所需的正则表达式需要满足以下条件:

  1. Match the number provided, exactly准确匹配提供的数字
  2. Ignore number matches in the file extension, if present忽略文件扩展名中的数字匹配项(如果存在)
  3. Handle file names that include spaces处理包含空格的文件名

I think this will meet all these requirements:我认为这将满足所有这些要求:

def generate(n):
    return re.compile(r'^[^.\d]*' + str(n) + r'[^.\d]*(\..*)?$')

def check_files(n, files):
    regex = generate(n)
    return [f for f in files if regex.fullmatch(f)]

Usage:用法:

>>> check_files(4, ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4'])
['ep 4', 'img4', '4xxx', 'file 4.mp4']

Note that this solution involves creating a Pattern object and using that object to check each file.请注意,此解决方案涉及创建模式 object 并使用该模式 object 检查每个文件。 This strategy offers a performance benefit over calling re.fullmatch with the pattern and filename directly, as the pattern does not have to be compiled for each call.与直接使用模式和文件名调用re.fullmatch ,此策略提供了性能优势,因为不必为每次调用编译模式。

This solution does have one drawback: it assumes that filenames are formatted as name.extension and that the value you're searching for is in the name part.该解决方案确实有一个缺点:它假设文件名的格式为name.extension并且您要搜索的值在name部分。 Because of the greedy nature of regular expressions, if you allow for file names with .由于正则表达式的贪婪特性,如果允许文件名带有. then you won't be able to exclude extensions from the search.那么您将无法从搜索中排除扩展名。 Ergo, modifying this to match ep.4 would also cause it to match file.mp4 .因此,修改它以匹配ep.4也会导致它匹配file.mp4 That being said, there is a workaround for this, which is to strip extensions from the file name before doing the match:也就是说,有一个解决方法,即在进行匹配之前从文件名中删除扩展名:

def generate(n):
    return re.compile(r'^[^\d]*' + str(n) + r'[^\d]*$')

def strip_extension(f):
    return f.removesuffix('.mp4')

def check_files(n, files):
    regex = generate(n)
    return [f for f in files if regex.fullmatch(strip_extension(f))]

Note that this solution now includes the .请注意,此解决方案现在包括. in the match condition and does not exclude an extension.在匹配条件下,不排除扩展名。 Instead, it relies on preprocessing (the strip_extension function) to remove any file extensions from the filename before matching.相反,它依赖于预处理( strip_extension函数)在匹配之前从文件名中删除任何文件扩展名。

As an addendum, occasionally you'll get files have the number prefixed with zeroes (ex. 004, 0001, etc.).作为附录,您偶尔会得到编号以零为前缀的文件(例如 004、0001 等)。 You can modify the regular expression to handle this case as well:您也可以修改正则表达式来处理这种情况:

def generate(n):
    return re.compile(r'^[^\d]*0*' + str(n) + r'[^\d]*$')

We can use re.search along with a list comprehension for a regex option:我们可以使用re.search以及对正则表达式选项的列表推导:

files = ['ep 4', 'xxx 3 ', 'img4', '4xxx', 'ep-40', 'file.mp4', 'file 4.mp4']
num = 4
regex = r'(?<!\d)' + str(num) + r'(?!\d)'
output = [f for f in files if re.search(regex, f)]
print(output)  # ['ep 4', 'img4', '4xxx', 'file.mp4', 'file 4.mp4']

this can be accomplished with the following function这可以通过以下 function 来完成

import os


files = ["ep 4", "xxx 3 ", "img4", "4xxx", "ep-40", "file.mp4", "file 4.mp4"]
desired_output = ["ep 4", "img4", "4xxx", "file 4.mp4"]


def number_filter(files, number):
    filtered_files = []
    for file_name in files:

        # if the number is not present, we can skip this file
        if file_name.count(str(number)) == 0:
            continue

        # if the number is present in the extension, but not in the file name, we can skip this file
        name, ext = os.path.splitext(file_name)

        if (
            isinstance(ext, str)
            and ext.count(str(number)) > 0
            and isinstance(name, str)
            and name.count(str(number)) == 0
        ):
            continue

        # if the number is preseent in the file name, we must determine if it's part of a different number
        num_index = file_name.index(str(number))

        # if the number is at the beginning of the file name
        if num_index == 0:
            # check if the next character is a digit
            if file_name[num_index + len(str(number))].isdigit():
                continue

        # if the number is at the end of the file name
        elif num_index == len(file_name) - len(str(number)):
            # check if the previous character is a digit
            if file_name[num_index - 1].isdigit():
                continue

        # if it's somewhere in the middle
        else:
            # check if the previous and next characters are digits
            if (
                file_name[num_index - 1].isdigit()
                or file_name[num_index + len(str(number))].isdigit()
            ):
                continue

        print(file_name)
        filtered_files.append(file_name)

    return filtered_files


output = number_filter(files, 4)

for file in output:
    assert file in desired_output

for file in desired_output:
    assert file in output

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM