简体   繁体   中英

Sorting a large amount of data into separate sets

I'm extracting up to 2500 frame files per experiment (not always the same amount), and my process at the moment is to manually divide the total number of frames by three to separate into three subset folders since the file size is too large to convert all into a.mat file. I simply want to automate this.

Once the files are separated into three subsets ('Subset1, Subset2, Subset3'), i run each folder through my code to convert and rename.

from scipy.io import savemat
import numpy as np
import os
arrays = []
directory = r"F:\...\Experiment 24\Imaging\Subset3" # **something here that will look at the while directory and create a different file for each subset folder**
sorted(os.listdir(directory))
for filename in sorted(os.listdir(directory)):
    f = os.path.join(directory, filename)
    arrays.append(np.load(f))
data = np.array(arrays)
data = data.astype('uint16')


data = np.moveaxis(data, [0, 1, 2], [2, 1, 0])

savemat('24_subset3.mat', {'data': data})

How can I automatically sort my frame files into three separate subset folders and convert?

Create subsets from the filenames and copy them to new subset directories:

num_subsets = 3
in_dir = "/some/path/to/input"
out_dir = "/some/path/to/output/subsets"

filenames = sorted(os.listdir(in_dir))
chunk_size = len(filenames) // num_subsets

for i in range(num_subsets):
    subset = filenames[i * chunk_size : (i + 1) * chunk_size]

    # Create subset output directory.
    subset_dir = f"{out_dir}/subset_{i}"
    os.makedirs(subset_dir, exist_ok=True)

    for filename in subset:
        shutil.copyfile(filename, f"{subset_dir}/{filename}")

NOTE: Any extra files that cannot be distributed into equal subsets will be skipped.

If your goal is simply to create your three.mat files, you don't necessarily need to create subfolders and move your files around at all; you can iterate through subsets of them in-place. You could manually calculate the indexes at which to divide into subsets, but more_itertools.divide is convenient and readable.

Additionally,pathlib is usually a more convenient way of manipulating paths and filenames. No more worrying about os.path.join ! The paths yielded by Path.iterdir or Path.glob know where they're located, and don't need to be recombined with their parent.

import pathlib

from more_itertools import divided
import numpy as np
from scipy.io import savemat


directory = Path("F:/.../Experiment 24/Imaging/")
subsets = divide(3, sorted(directory.iterdir()))

for index, subset in enumerate(subsets, start=1):
    arrays = [np.load(file) for file in subset]
    data = np.array(arrays).astype('uint16')
    data = np.moveaxis(data, [0, 1, 2], [2, 1, 0])
    savemat(f'24_subset{index}.mat', {'data': data})

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM