简体   繁体   English

如何验证我的训练作业是否正在读取增强清单文件?

[英]How can I verify that my training job is reading the augmented manifest file?

Apologies for the long post.很抱歉发了这么长的帖子。

Originally, I had data in one location on an S3 bucket and used to train deep learning image classification models on this data using the typical 'File' mode and passing the S3 uri where the data is stored as training input.最初,我将数据放在 S3 存储桶的一个位置,并使用典型的“文件”模式并传递存储数据的 S3 uri 作为训练输入,以此数据为基础训练深度学习图像分类模型。 To try and accelerate training, I wanted to switch to using:为了尝试加速训练,我想改用:

  1. Pipe mode, to stream data and not download all the data at the beginning of the training, starting training faster and saving disk space. Pipe模式,到stream数据,训练开始时不下载所有数据,开始训练更快,节省磁盘空间。
  2. Augmented Manifest File coupled with 1., so that I don't have to place my data in a single location on S3, so I avoid moving data around when I train models.增强清单文件加上 1.,这样我就不必将数据放在 S3 上的单个位置,因此我避免在训练模型时四处移动数据。

I was making my script similar to the one in this example .我使我的脚本类似于本示例中的脚本。 I printed the steps done when parsing the data, however I noticed that the data might not have been read because when printing it shows the following:我在解析数据时打印了完成的步骤,但是我注意到数据可能没有被读取,因为在打印时它显示如下:

step 1 Tensor("ParseSingleExample/ParseExample/ParseExampleV2:0", shape=(), dtype=string)
step 2 Tensor("DecodePng:0", shape=(None, None, 3), dtype=uint8)
step 3 Tensor("Cast:0", shape=(None, None, 3), dtype=float32)

I guess the image is not being read/found since the shape is [None, None, 3] when it should be [224, 224, 3] , so maybe the problem is from the Augmented Manifest file?我猜图像没有被读取/找到,因为形状是[None, None, 3]而它应该是[224, 224, 3] ,所以问题可能来自 Augmented Manifest 文件?

Here's an example of how my Augmented Manifest file is written:这是我的 Augmented Manifest 文件编写方式的示例:

{"image-ref": "s3://path/to/my/image/image1.png", "label": 1}
{"image-ref": "s3://path/to/my/image/image2.png", "label": 2}
{"image-ref": "s3://path/to/my/image/image3.png", "label": 3}

Some other details I should probably mention:我可能应该提到的其他一些细节:

  1. When I create the Training Input I pass 'content_type': 'application/x-recordio', 'record_wrapping': 'RecordIO' , even though my data are in.png format, but I assumed that as the augmented manifest file is read the data get wrapped in the RecordIO format.当我创建训练输入时,我传递'content_type': 'application/x-recordio', 'record_wrapping': 'RecordIO' ,即使我的数据是 in.png 格式,但我假设在读取增强清单文件时数据以 RecordIO 格式包装。
  2. Following my first point, I pass PipeModeDataset(channel=channel, record_format='RecordIO') , so also not sure about the RecordIO thing.在我的第一点之后,我通过PipeModeDataset(channel=channel, record_format='RecordIO') ,所以也不确定 RecordIO 的事情。

There isn't an actual error that is raised, just when I start fitting the model nothing happens, it keeps on running but nothing actually runs so I'm trying to find the issue.没有引发实际错误,就在我开始安装 model 时没有任何反应,它继续运行但实际上没有任何运行所以我试图找到问题。


EDIT: It now reads the shape correctly, but there's still the issue where it enters the.fit method and does nothing, just keeps running without doing anything.编辑:它现在可以正确读取形状,但仍然存在进入 .fit 方法但什么都不做的问题,只是继续运行而不做任何事情。 Find part of the script below.在下面找到部分脚本。

def train_input_fn(train_channel):
    """Returns input function that feeds the model during training"""
    return _input_fn(train_channel)

def _input_fn(channel):
    """
        Returns a Dataset which reads from a SageMaker PipeMode channel.
    """
    
    features = {
        'image-ref': tf.io.FixedLenFeature([], tf.string),
        'label': tf.io.FixedLenFeature([3], tf.int64),
    }
 
    def combine(records):
        return records[0], records[1]
 
    def parse(record):
        
        parsed = tf.io.parse_single_example(record, features)
        
                 

        image = tf.io.decode_png(parsed["image-ref"], channels=3, dtype=tf.uint8)
        image = tf.reshape(image, [224, 224, 3])
        
        lbl = parsed['label']
        print(image, lbl)
        return (image, lbl)
 
    ds = PipeModeDataset(channel=channel, record_format='RecordIO')
    ds = ds.map(parse, num_parallel_calls=AUTOTUNE)
    ds = ds.prefetch(AUTOTUNE)
 
    return ds

def model(dataset):
    """Generate a simple model"""
    inputs = Input(shape=(224, 224, 3))
    prediction_layer = Dense(2, activation = 'softmax')


    x = inputs
    x = tf.keras.applications.mobilenet.MobileNet(include_top=False, input_shape=(224,224,3), weights='imagenet')(x)
    outputs = prediction_layer(x)
    rec_model = tf.keras.Model(inputs, outputs)    
    
    rec_model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
        metrics=['accuracy']
    )
    
    
    rec_model.fit(
        dataset
    )

    return rec_model

def main(params):
    
    epochs = params['epochs']
    train_channel = params['train_channel']
    record_format = params['record_format']
    batch_size = params['batch_size']
        
    train_spec = train_input_fn(train_channel)
    model_classifier = model(train_spec)

From here :这里开始

A PipeModeDataset can read TFRecord, RecordIO, or text line records. PipeModeDataset 可以读取 TFRecord、RecordIO 或文本行记录。

While your'e trying to read binary (PNG) files.当您尝试读取二进制 (PNG) 文件时。 I don't see a relevant record reader here to help you do that.我在这里没有看到相关的记录阅读器来帮助你做到这一点。
You could build your own format pipe implementation like shown here , but it's considerably more effort.您可以构建自己的格式 pipe 实现,如此处所示,但要付出更多的努力。

Alternatively, you mentioned your files are scattered in different folders, but if your files common path contains less than 2M files, you could use FastFile mode to stream data.或者,您提到您的文件分散在不同的文件夹中,但如果您的文件公共路径包含小于 2M 的文件,您可以使用FastFile 模式处理 stream数据。 Currently, FastFile only supports an S3 Prefix, so you won't be able to use a manifest.目前,FastFile 仅支持 S3 前缀,因此您无法使用清单。

Also see this general pros/cons discussion of the different available storage and input types available in SageMaker .另请参阅关于 SageMaker 中可用的不同可用存储和输入类型的一般优缺点讨论

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Sagemaker 脚本模式下恢复训练作业? - How can I resume a training job in Sagemaker script mode? AWS Sagemaker 无法解析增强的清单文件 - AWS Sagemaker Unable to Parse Augmented Manifest File 如何查看 AutoML Tables 训练作业的训练拆分? - How do I see the training splits for an AutoML Tables training job? 使用 Redshift 读取没有清单文件的增量表 - Reading a Delta Table with no Manifest File using Redshift 如何为 SageMaker 训练作业准备 docker 图像 - How to prepare docker image for SageMaker training job 如何运行 sagemaker 处理和训练工作? - How to run sagemaker processing and training job? 如何验证 AWS Cognito 中的用户属性? - How can I verify a user attribute in AWS Cognito? 如何让 React 应用定期验证用户令牌? - How can I make React app verify user token regularly? Sagemaker 培训作业失败“”FileNotFoundError:[Errno 2] 没有这样的文件或目录:'/opt/ml/input/data/training/annotations.json'” - Sagemaker training job fails ""FileNotFoundError: [Errno 2] No such file or directory: '/opt/ml/input/data/training/annotations.json'" 在 AWS Databrew 中,如何阻止 Databrew 作业对结果文件进行分区? - In AWS Databrew, how can I stop the Databrew job from partitioning the result file?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM