如何列出所有不包含文件类型的目录？

Question

我试图返回所有目录的唯一列表（ set ），如果它们不包含某些文件类型。 如果找不到该文件类型，则将该目录名称添加到列表中以进行进一步审核。

下面的函数将查找所有有效文件夹并将其添加到集合中以进行进一步比较。 我想将其扩展为仅返回out_list不包含文件的那些目录。 这些目录可以包含子目录，文件在out_list 。 如果是真的，我只想要有效目录的文件夹名称的路径。

# directory = r'w:\workorder'
#
# Example:
# w:\workorder\region1\12345678\hi.pdf
# w:\workorder\region2\23456789\test\bye.pdf
# w:\workorder\region3\34567891\<empty>
# w:\workorder\region4\45678912\Final.doc
# 
# Results:
# ['34567891', '45678912']

job_folders = set([]) #set list is unique
out_list = [".pdf", ".ppt", ".txt"]

def get_filepaths(directory):
    """
    This function will generate the file names in a directory
    tree by walking the tree either top-down or bottom-up. For each
    directory in the tree rooted at directory top (including top itself),
    it yields a 3-tuple (dirpath, dirnames, filenames).
    """

    folder_paths = []  # List which will store all of the full filepaths.

    # Walk the tree.

    for item in os.listdir(directory):
        if os.path.isdir(os.path.join(directory, item)):
            folderpath = os.path.join(directory, item) # Join the two strings in order to form the full folderpath.
            if re.search('^[0-9]', item):
                job_folders.add(item[:8])
            folder_paths.append(folderpath)  # Add it to the list.
    return folder_paths

Answer 1

这是您想要的吗？

import os

def main():
    exts = {'.pdf', '.ppt', '.txt'}
    for directory in get_directories_without_exts('W:\\workorder', exts):
        print(directory)

def get_directories_without_exts(root, exts):
    for root, dirs, files in os.walk(root):
        for file in files:
            if os.path.splitext(file)[1] in exts:
                break
        else:
            yield root

if __name__ == '__main__':
    main()

编辑：查看您的要求后，我决定创建一个树对象来分析您的目录结构。 一旦创建，就很容易通过缓存进行递归查询，以找出目录“是否可以使用”。 从那里开始，创建一个仅查找“不正确”顶级目录的生成器非常简单。 也许有更好的方法可以做到这一点，但是代码至少应该可以工作。

import os

def main():
    exts = {'.pdf', '.ppt', '.txt'}
    for directory in Tree('W:\\workorder', exts).not_okay:
        print(directory)

class Tree:

    def __init__(self, root, exts):
        if not os.path.isdir(root):
            raise ValueError('root must be a directory')
        self.name = root
        self.exts = exts
        self.files = set()
        self.directories = []
        try:
            names = os.listdir(root)
        except OSError:
            pass
        else:
            for child in names:
                path = os.path.join(root, child)
                if os.path.isfile(path):
                    self.files.add(os.path.splitext(child)[1])
                elif os.path.isdir(path):
                    self.directories.append(self.__class__(path, exts))
        self._is_okay = None

    @property
    def is_okay(self):
        if self._is_okay is None:
            self._is_okay = any(c.is_okay for c in self.directories) or \
                            any(c in self.exts for c in self.files)
        return self._is_okay

    @property
    def not_okay(self):
        if self.is_okay:
            for child in self.directories:
                for not_okay in child.not_okay:
                    yield not_okay
        else:
            yield self.name

if __name__ == '__main__':
    main()

Answer 2

要获取文件扩展名：

name,ext = os.path.splitext(os.path.join(directory,item))
if ext not in out_list:
    job_folders.add(item[:8])

Answer 3

您是否从其他地方复制并粘贴了现有代码？ 因为该文档字符串似乎是os.walk的文档字符串，所以...

您的问题在以下几点上不清楚：

您声明该代码的目标是“如果所有目录不包含某些文件类型，则返回所有目录的唯一列表（集合）”。
- 首先， list和set是不同的数据结构。
- 其次，您的代码创建了一个 ： job_folders是一set包含数字的文件夹名称，而folder_paths是包含文件夹的完整路径的list ，无论它们是否包含数字。
- 您实际上想在这里输出什么？
应该递归定义“那些在out_list中不包含文件的目录”，还是只包含这些目录的第一级内容？ 我的解决方案假设后者
- 你举的例子是在这一点上的矛盾：它显示34567891在结果，而不是 region3的结果。 不管定义是否是递归的，都应将region3包括在结果中，因为region3不包含任何带有列出的扩展名的文件。
应该只用满足其内容标准的目录或所有包含数字的文件夹名称填充job_folders吗？ 我的解决方案假设后者

我要强调的一种糟糕的做法是您使用全局变量out_list和job_folders 。 我已将前者更改为get_filepaths的第二个参数，而将后者更改为第二个返回值。

无论如何，解决方案就在这里...

import os, re

ext_list = [".pdf", ".ppt", ".txt"]

def get_filepaths(directory, ext_list):
    folder_paths = []  # List which will store all of the full filepaths.
    job_folders = set([])

    # Walk the tree.

    for dir, subdirs, files in os.walk(directory):
        _, lastlevel = os.path.split(dir)
        if re.search('^[0-9]', lastlevel):
            job_folders.add(lastlevel[:8])

        for item in files:
            root, ext = os.path.splitext(item)
            if ext in ext_list:
                break
        else:
            # Since none of the file extensions matched ext_list, add it to the list of folder_paths
            folder_paths.append(os.path.relpath(dir, directory))

    return folder_paths, job_folders

我在/tmp下创建了一个与您相同的目录结构，并运行以下命令：

folder_paths, job_folders = get_filepaths( os.path.expandvars(r"%TEMP%\workorder"), ext_list )

print "folder_paths =", folder_paths
print "job_folders =", job_folders

这是输出：

folder_paths = ['.', 'region1', 'region2', 'region2\\23456789', 'region3', 'region3\\34567891', 'region4', 'region4\\456789123']
job_folders = set(['12345678', '23456789', '34567891', '45678912'])

如您所见，输出folder_paths中不包含region1\\12345678和region2\\23456789\\test ，因为它们确实直接包含指定扩展名的文件。 所有其他子目录都包含在输出中，因为它们不直接包含指定扩展名的文件。

Answer 4

感谢@DanLenski和@NoctisSkytower，我得以解决这个问题。 当in_path时，我的WorkOrder目录始终位于第7个文件夹下，我发现使用os.sep找到它。
我从您的两个解决方案中都借鉴了以下内容：

import os, re

ext_list = [".pdf"]
in_path = r'\\server\E\Data\WorkOrder'

def get_filepaths(directory, ext_list):
    not_okay = set([])  # Set which will store Job folder where no ext_list files found
    okay = set([]) # Set which will store Job folder where ext_list files found
    job_folders = set([]) #valid Job ID folder

    # Walk the tree.
    for dir, subdirs, files in os.walk(directory):

        for item in files:
            root, ext = os.path.splitext(item)

            if len(dir.split(os.sep)) >= 8: #Tree must contain Job ID folder
                job_folder = dir.split(os.sep)[7]
                if ext in ext_list:
                    okay.add(job_folder)
                else: # Since none of the file extensions matched ext_list, add it to the list of folder_paths
                    not_okay.add(job_folder)

    bad_list = list(not_okay - okay)
    bad_list.sort()

    return bad_list

bad_list = get_filepaths( os.path.expandvars(in_path), ext_list )

如何列出所有不包含文件类型的目录？

问题描述

4 个解决方案

解决方案1
1 2015-07-10 17:25:19

解决方案2
0 2015-07-10 17:25:02

解决方案3
0 已采纳 2015-07-10 17:56:43

解决方案4
0 2015-07-13 20:43:49

如何列出所有不包含文件类型的目录？

问题描述

4 个解决方案

解决方案1 1 2015-07-10 17:25:19

解决方案2 0 2015-07-10 17:25:02

解决方案3 0 已采纳 2015-07-10 17:56:43

解决方案4 0 2015-07-13 20:43:49

解决方案1
1 2015-07-10 17:25:19

解决方案2
0 2015-07-10 17:25:02

解决方案3
0 已采纳 2015-07-10 17:56:43

解决方案4
0 2015-07-13 20:43:49