How can I verify that my training job is reading the augmented manifest file?

Question

Apologies for the long post.

Originally, I had data in one location on an S3 bucket and used to train deep learning image classification models on this data using the typical 'File' mode and passing the S3 uri where the data is stored as training input. To try and accelerate training, I wanted to switch to using:

Pipe mode, to stream data and not download all the data at the beginning of the training, starting training faster and saving disk space.
Augmented Manifest File coupled with 1., so that I don't have to place my data in a single location on S3, so I avoid moving data around when I train models.

I was making my script similar to the one in this example . I printed the steps done when parsing the data, however I noticed that the data might not have been read because when printing it shows the following:

step 1 Tensor("ParseSingleExample/ParseExample/ParseExampleV2:0", shape=(), dtype=string)
step 2 Tensor("DecodePng:0", shape=(None, None, 3), dtype=uint8)
step 3 Tensor("Cast:0", shape=(None, None, 3), dtype=float32)

I guess the image is not being read/found since the shape is [None, None, 3] when it should be [224, 224, 3] , so maybe the problem is from the Augmented Manifest file?

Here's an example of how my Augmented Manifest file is written:

{"image-ref": "s3://path/to/my/image/image1.png", "label": 1}
{"image-ref": "s3://path/to/my/image/image2.png", "label": 2}
{"image-ref": "s3://path/to/my/image/image3.png", "label": 3}

Some other details I should probably mention:

When I create the Training Input I pass 'content_type': 'application/x-recordio', 'record_wrapping': 'RecordIO' , even though my data are in.png format, but I assumed that as the augmented manifest file is read the data get wrapped in the RecordIO format.
Following my first point, I pass PipeModeDataset(channel=channel, record_format='RecordIO') , so also not sure about the RecordIO thing.

There isn't an actual error that is raised, just when I start fitting the model nothing happens, it keeps on running but nothing actually runs so I'm trying to find the issue.

EDIT: It now reads the shape correctly, but there's still the issue where it enters the.fit method and does nothing, just keeps running without doing anything. Find part of the script below.

def train_input_fn(train_channel):
    """Returns input function that feeds the model during training"""
    return _input_fn(train_channel)

def _input_fn(channel):
    """
        Returns a Dataset which reads from a SageMaker PipeMode channel.
    """
    
    features = {
        'image-ref': tf.io.FixedLenFeature([], tf.string),
        'label': tf.io.FixedLenFeature([3], tf.int64),
    }
 
    def combine(records):
        return records[0], records[1]
 
    def parse(record):
        
        parsed = tf.io.parse_single_example(record, features)
        
                 

        image = tf.io.decode_png(parsed["image-ref"], channels=3, dtype=tf.uint8)
        image = tf.reshape(image, [224, 224, 3])
        
        lbl = parsed['label']
        print(image, lbl)
        return (image, lbl)
 
    ds = PipeModeDataset(channel=channel, record_format='RecordIO')
    ds = ds.map(parse, num_parallel_calls=AUTOTUNE)
    ds = ds.prefetch(AUTOTUNE)
 
    return ds

def model(dataset):
    """Generate a simple model"""
    inputs = Input(shape=(224, 224, 3))
    prediction_layer = Dense(2, activation = 'softmax')


    x = inputs
    x = tf.keras.applications.mobilenet.MobileNet(include_top=False, input_shape=(224,224,3), weights='imagenet')(x)
    outputs = prediction_layer(x)
    rec_model = tf.keras.Model(inputs, outputs)    
    
    rec_model.compile(
        optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
        loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=False),
        metrics=['accuracy']
    )
    
    
    rec_model.fit(
        dataset
    )

    return rec_model

def main(params):
    
    epochs = params['epochs']
    train_channel = params['train_channel']
    record_format = params['record_format']
    batch_size = params['batch_size']
        
    train_spec = train_input_fn(train_channel)
    model_classifier = model(train_spec)

Answer 1

From here :

A PipeModeDataset can read TFRecord, RecordIO, or text line records.

While your'e trying to read binary (PNG) files. I don't see a relevant record reader here to help you do that.
You could build your own format pipe implementation like shown here , but it's considerably more effort.

Alternatively, you mentioned your files are scattered in different folders, but if your files common path contains less than 2M files, you could use FastFile mode to stream data. Currently, FastFile only supports an S3 Prefix, so you won't be able to use a manifest.

Also see this general pros/cons discussion of the different available storage and input types available in SageMaker .

How can I verify that my training job is reading the augmented manifest file?

Question

1 answers

solution1
3 ACCPTED 2022-03-18 12:38:20

How can I verify that my training job is reading the augmented manifest file?

Question

1 answers

solution1 3 ACCPTED 2022-03-18 12:38:20

solution1
3 ACCPTED 2022-03-18 12:38:20