简体   繁体   English

当应用于我的数据集时,VOCBboxDataset返回的数据集大小不正确

[英]VOCBboxDataset returns incorrect dataset size when applied to my dataset

I have a 250 image dataset and 250 annotation files with two classes: ball and player. 我有250个图像数据集和250个带有两个类的注释文件:ball和player。 The folder also has three text files train.txt, val.txt, test.txt containing lists of training,testing and validation images respectively. 该文件夹还具有三个文本文件train.txt,val.txt,test.txt,分别包含训练,测试和验证图像的列表。

bball_labels = ('ball','player')
class BBall_dataset(VOCBboxDataset):
  def _get_annotations(self, i):
    id_ = self.ids[i]
    anno = ET.parse(os.path.join(self.data_dir, 'Annotations', id_ + 
'.xml'))
    bbox = []
    label = []
    difficult = []
    for obj in anno.findall('object'):
      bndbox_anno = obj.find('bndbox')
      bbox.append([int(bndbox_anno.find(tag).text) - 1 for tag in ('ymin', 
'xmin', 'ymax', 'xmax')])
      name = obj.find('name').text.lower().strip()
      label.append(bball_labels.index(name))
    bbox = np.stack(bbox).astype(np.float32)
    label = np.stack(label).astype(np.int32)
    difficult = np.array(difficult, dtype=np.bool)
    return bbox, label, difficult

Out of 250 I have put 170 as train, 70 as val and 10 as test images. 在250个中,我已将170作为火车,将70作为val,将10作为测试图像。 But while printing length of train,val and test dataset I always get train+12, train+3 nad test. 但是在打印火车,val和测试数据集的长度时,我总是得到火车+12,火车+3 nad测试。 Eg in this case it will show as 182,73,10 for train, val and test. 例如,在这种情况下,对于火车,val和测试,它将显示为182,73,10。 Test value does not change. 测试值不变。 Everytime train and val value will increase by 12 and 3. 每次火车和val值将分别增加12和3。

valid_dataset = BBall_dataset('BasketballDataset', 'val')
test_dataset = BBall_dataset('BasketballDataset', 'test')
train_dataset = BBall_dataset('BasketballDataset', 'train') 

print('Number of images in "train" dataset:', len(train_dataset))
print('Number of images in "valid" dataset:', len(valid_dataset))
print('Number of images in "test" dataset:', len(test_dataset))

Number of images in "train" dataset: 182 Number of images in "valid" dataset: 73 Number of images in "test" dataset: 10 “训练”数据集中的图像数量:182“有效”数据集中的图像数量:73“测试”数据集中的图像数量:10

Why does this happen and how to prevent this. 为什么会发生这种情况以及如何防止这种情况发生。 And also does it in someway affect my training process? 并且在某种程度上也影响了我的培训过程吗?

train.txt link ( https://imgur.com/B1Gszfi ) val.txt link ( https://imgur.com/kOcIZ5h ) train.txt链接( https://imgur.com/B1Gszfi)val.txt链接( https://imgur.com/kOcIZ5h

The issue was due to a small overlooked situation where the text files had gaps as the image list was cut, copied and pasted in the same file. 该问题是由于一个很小的情况而导致的,即在剪切,复制和粘贴图像列表时,文本文件之间存在间隙。 The text files were created in notepad. 文本文件在记事本中创建。 In notepad index is not visible, but the gaps are visible once you view the text files in github where the initial indexing is still present and the indexing remains even though the list was cut down in size. 在记事本中索引是不可见的,但是一旦您在github中查看文本文件(其中列表已缩小,但仍存在初始索引)并且索引仍然存在,差距就可见了。 Eg first a list of 182 images were created but later cut down to 170. So when we use the Dataset Creation object the code reads all the lines of the text file ie it will read 182 instead of 170. We need to make sure that the number of index and the number of image is the same to avoid this problem. 例如,首先创建了182张图像的列表,但随后缩减为170张。因此,当我们使用“数据集创建”对象时,代码将读取文本文件的所有行,即它将读取182而不是170。我们需要确保索引数和图像数相同,可以避免此问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM