简体   繁体   中英

How to INCLUDE certain pre-processing step into model for Tensorflow serving

I have built a model with different features. For the preprocessing I have used mainly feature_columns . For instance, for bucketizing GEO information or for embedding categorical data with a large amount of different values. Additionally, I had to preprocess two of my features before using feature_columns:

Feature “STREET”

def __preProcessStreet(data, tokenizer=None):

    data['STREETPRO'] = data['STREET'].apply(lambda x: __getNormalizedString(x, ["gasse", "straße", "strasse", "str.", "g.", " "], False))

    if tokenizer == None:
        tokenizer = Tokenizer(split='XXX')
        tokenizer.fit_on_texts(data['STREETPRO'])

    street_tokenized = tokenizer.texts_to_sequences(data['STREETPRO'])

    data['STREETW'] = tf.keras.preprocessing.sequence.pad_sequences(street_tokenized, maxlen=1)

    return data, tokenizer

As you can see, I did the preprocessing steps directly on the loaded Pandas dataframe. Afterwards I processed this new column with the help of the mentioned columns:

def __getFutureColumnStreet(street_num_words):

    street_voc = tf.feature_column.categorical_column_with_identity(
        key='STREETW', num_buckets=street_num_words)

    dim = __getNumberOfDimensions(street_num_words)

    street_embedding = feature_column.embedding_column(street_voc, dimension=dim)

    return street_embedding

Feature “NAME1

The preprocessing steps for the NAME1 column are quite similar except of the fact that I have split the NAME1 field in two different fields “NAME1W1” and “NAME1W2” which include the two most common words in the vocabulary:

def __preProcessName(data, tokenizer=None):

    data['NAME1PRO'] = data['NAME1'].apply(lambda x: __getNormalizedString(x, ["(asg)", "asg", "(poasg)", "poasg"]))

    if tokenizer == None:
        tokenizer = Tokenizer()
        tokenizer.fit_on_texts(data['NAME1PRO'])

    name1_tokenized = tokenizer.texts_to_sequences(data['NAME1PRO'])

    name1_tokenized_pad = tf.keras.preprocessing.sequence.pad_sequences(name1_tokenized, maxlen=2, truncating='pre')

    data = pd.concat([data, pd.DataFrame(name1_tokenized_pad, columns=['NAME1W1', 'NAME1W2'])], axis=1)

    return data, tokenizer

Afterwards I also used feature_colums for the word embedding:

def __getFutureColumnsName(name_num_words):

    namew1_voc = tf.feature_column.categorical_column_with_identity(
        key='NAME1W1', num_buckets=name_num_words)
    namew2_voc = tf.feature_column.categorical_column_with_identity(
        key='NAME1W2', num_buckets=name_num_words)

    dim = __getNumberOfDimensions(name_num_words)

    namew1_embedding = feature_column.embedding_column(namew1_voc, dimension=dim)
    namew2_embedding = feature_column.embedding_column(namew2_voc, dimension=dim)

    return (namew1_embedding, namew2_embedding)

Model

I am using the Functional API of TensorFlow for constructing my model:

                print("start preprocessing...")
                feature_columns = feature_selection.getFutureColumns(data, args.zip, args.sc, bucketSizeGEO, False)
                feature_layer = tf.keras.layers.DenseFeatures(feature_columns, trainable=True)
                print("preprocessing completed")

…                

                            print("Step {}/{}".format(currentStep, stepNum))

                            feature_layer_inputs = feature_selection.getFeatureLayerInputs()
                            new_layer = feature_layer(feature_layer_inputs)
                            

                            for _ in range(numLayers):
                                new_layer = tf.keras.layers.Dense(numNodes, activation=tf.nn.swish, kernel_regularizer=regularizers.l2(reg), bias_regularizer=regularizers.l2(reg))(new_layer)
                                new_layer = tf.keras.layers.Dropout(dropRate)(new_layer) 

                            output_layer = tf.keras.layers.Dense(1, activation=tf.nn.sigmoid, kernel_regularizer=regularizers.l2(reg), bias_regularizer=regularizers.l2(reg))(new_layer)

                            model = tf.keras.Model(inputs=[v for v in feature_layer_inputs.values()], outputs=output_layer)

                            model.compile(optimizer=opt,
                                loss='binary_crossentropy',
                                metrics=['accuracy'])

                            paramString = "Arg-e{}-b{}-l{}-n{}-o{}-z{}-r{}-d{}".format(args.epoch, args.batchSize, numLayers, numNodes, opt, bucketSizeGEO, reg, dropRate)

                            log_dir = "logs\\neural\\" + paramString + "\\" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
                            tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1)

                            print("Start training with the following parameters:", paramString)

                            model.fit(train_ds,
                                    validation_data=val_ds,
                                    epochs=args.epoch,
                                    callbacks=[tensorboard_callback])

TensorFlow Serving

Logically the two preprocessing steps containing the Tokenizer are not part of the model and therefore can't be processed during the serving so that a POST command for the model server looks like this (on Windows):

curl -d "{"""instances""": [{"""NAME1W1""": [12], """NAME1W2""": [2032], """ZIP""": [""1120""], """STREETW""": [1180], """LONGITUDE""": 16.47, """LATITUDE""": 48.22, """AVIS_TYPE""": [""E""],"""ASG""": [0], """SC""": [""101""], """PREDICT""": [0]}]}" -X POST http://localhost:8501/v1/models/my_model:predict

So at the moment I am trying to find a way to include this two preprocessing steps inside my model so that the POST command would look like this:

curl -d "{"""instances""": [{"""NAME1""": [“”Max Mustermann””], """ZIP""": [""1120""], """STREET""": [Teststraße], """LONGITUDE""": 16.47, """LATITUDE""": 48.22, """AVIS_TYPE""": [""E""],"""ASG""": [0], """SC""": [""101""], """PREDICT""": [0]}]}" -X POST http://localhost:8501/v1/models/my_model:predict

but with the same pre-processing steps inside the model.

I tried to use map functions on the datasets or preprocessing layers but without success because I 'am not sure if I can use a combination of them with the future_columns. I also tried something similar like mentioned here: https://keras.io/examples/structured_data/structured_data_classification_from_scratch/

I think TFX Transform component is what you need. It will not be part of your model, but part of your pipeline. That way, you can easily modify the preprocessing transformation that you want in the future without affecting the model.

The main function of that component is preprocessing_fn , this will be the series of transformations you want to apply to the inputs. TensorFlow guide provide much better explanation and tutorial for you to try.

Here's some references:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM