将 JPG 和 XML 文件的数据集拆分为训练集和测试集

Question

I have a dataset for an object detection algorithm containing pictures (.jpg) and corresponding .xml files containing bounding boxes.我有一个对象检测算法的数据集，其中包含图片 (.jpg) 和包含边界框的相应 .xml 文件。

I want to write a script that randomly splits the dataset into train and test set which means i have to make sure i allocate the jpg with it's corresponding XML to the same directory.我想编写一个脚本，将数据集随机拆分为训练集和测试集，这意味着我必须确保将带有相应 XML 的 jpg 分配到同一目录。

How should i edit the following code in order to fulfill this?我应该如何编辑以下代码以实现这一点？

Also, is this the "best" way of doing this or is it better to split the dataset after xml-to-csv conversion or after generating csv to tfrecords conversion?另外，这是执行此操作的“最佳”方式还是在 xml-to-csv 转换或生成 csv 到 tfrecords 转换后拆分数据集更好？

import shutil, os, glob, random

# List all files in a directory using os.listdir
basepath = '/home/createview/Vegard/createview/lice_detection_v2/workspace/images/Synced_dataset'
filenames = []

for entry in os.listdir(basepath):
    if os.path.isfile(os.path.join(basepath, entry)):
        #print(entry)
        filenames.append(entry)

filenames.sort()  # make sure that the filenames have a fixed order before shuffling
random.seed(230)
random.shuffle(filenames) # shuffles the ordering of filenames (deterministic given the chosen seed)

split = int(0.8 * len(filenames))
train_filenames = filenames[:split]
test_filenames = filenames[split:]

Answer 1

The best option to me is to create two list of files ( filenames for jpg and xmlnames for xml ) in the correct order and one list of indices indices=[i for i in range(len(filenames))] . 对我而言，最好的选择是按照正确的顺序创建两个文件列表（ jpg的文件filenames和xml xmlnames ），以及一个索引indices=[i for i in range(len(filenames))]列表indices=[i for i in range(len(filenames))]列表indices=[i for i in range(len(filenames))] 。

Then you can shuffle your indices list : 然后，您可以重新整理索引列表：

random.seed(230)
random.shuffle(indices)

Finally, you create your train and test sets for both your jpg and xml files: 最后，为jpg和xml文件创建训练和测试集：

split = int(0.8 * len(filenames))
file_train = [filenames[idx] for idx in indices[:split]]
file_test = [filenames[idx] for idx in indices[split:]]
xml_train = [xmlnames[idx] for idx in indices[:split]]
xml_test = [xmlnames[idx] for idx in indices[split:]]

Answer 2

import shutil, os, glob, random

# List all files in a directory using os.listdir
basepath = 'images/'
labelpath='label/'
filenames = []
xmlnames = []

for entry in os.listdir(basepath):
    if os.path.isfile(os.path.join(basepath, entry)):
        print(entry)
        filenames.append(entry)
        
        
for entry in os.listdir(labelpath):
    if os.path.isfile(os.path.join(labelpath, entry)):
        print(entry)
        xmlnames.append(entry)

indices=[i for i in range(len(filenames))]        
filenames.sort()
xmlnames.sort() # make sure that the filenames have a fixed order before shuffling
random.seed(230)
random.shuffle(indices) # shuffles the ordering of filenames (deterministic given the chosen seed)

split = int(0.8 * len(filenames))
file_train = [filenames[idx] for idx in indices[:split]]
file_test = [filenames[idx] for idx in indices[split:]]
xml_train = [xmlnames[idx] for idx in indices[:split]]
xml_test = [xmlnames[idx] for idx in indices[split:]]

print(file_test)
print(xml_test)

so i followed the above advice(by Joseph) to add indices and then when we make test and train variables the exact same images and labels are added in the variables, hope this helps所以我按照上面的建议（由约瑟夫）添加索引，然后当我们进行测试和训练变量时，在变量中添加完全相同的图像和标签，希望这会有所帮助

将 JPG 和 XML 文件的数据集拆分为训练集和测试集

问题描述

2 个解决方案

解决方案1
0 已采纳 2019-07-04 09:11:48

解决方案2
0 2021-10-20 06:32:20

将 JPG 和 XML 文件的数据集拆分为训练集和测试集

问题描述

2 个解决方案

解决方案1 0 已采纳 2019-07-04 09:11:48

解决方案2 0 2021-10-20 06:32:20

解决方案1
0 已采纳 2019-07-04 09:11:48

解决方案2
0 2021-10-20 06:32:20