简体   繁体   English

在子目录中随机选择 x 个文件

[英]Select randomly x files in subdirectories

I need to take exactly 10 files (images) in a dataset randomly, but this dataset is hierarchically structured.我需要在一个数据集中随机取 10 个文件(图像),但这个数据集是分层结构的。

So I need that for each subdirectory that contains images hold just 10 of them randomly.所以我需要每个包含图像的子目录只随机保存 10 个。 Is there an easy way to do that or I should do it manually?有没有简单的方法可以做到这一点,还是我应该手动完成?

def getListOfFiles(dirName):
    ### create a list of file and sub directories 
    ### names in the given directory 
    listOfFile = os.listdir(dirName)
    allFiles = list()
    ### Iterate over all the entries
    for entry in listOfFile:

        ### Create full path
        fullPath = os.path.join(dirName, entry)
        ### If entry is a directory then get the list of files in this directory 
        if os.path.isdir(fullPath):
            allFiles = allFiles + getListOfFiles(fullPath)
        else:
            allFiles.append(random.sample(fullPath, 10))
    return allFiles

dirName = 'C:/Users/bla/bla'

### Get the list of all files in directory tree at given path
listOfFiles = getListOfFiles(dirName)

with open("elements.txt", mode='x') as f:
    for elem in listOfFiles:
        f.write(elem + '\n')

Good approach to sample from unknown size directory listing is to use Reservoir Sampling .从未知大小的目录列表中取样的好方法是使用Reservoir Sampling With this approach, you don't have to run upfront and list all files in the directory.使用这种方法,您不必预先运行并列出目录中的所有文件。 Read it one-by-one and sample.一一阅读并举例。 It even works when you have to sample fixed number of files across multiple directories.当您必须跨多个目录对固定数量的文件进行采样时,它甚至可以工作。

It would be good to use generator-based directory scanning code, which picks one file at a time, thus you don't use gobs of memory upfront to hold all file names.最好使用基于生成器的目录扫描代码,它一次选择一个文件,因此您不必预先使用大量内存来保存所有文件名。

Along the lines (NB! undested code!)沿着线(注意!未定义的代码!)

import numpy as np
import os

def ResSampleFiles(dirname, N):
    """pick N files from directory"""

    sampled_files = list()
    k = 0
    for item in scandir(dirname):
        if item.is_dir():
            continue
        full_path = os.path.join(dirname, item.name)
        if k < N:
            sampled_files.append(full_path)
        else:
            idx = np.random.randint(0, k+1)
            if (idx < N):
                sampled_files[idx] = full_path
        k += 1

    return sampled_files

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM