使用带有编码词汇表的 StringLookup 层保存 tensorflow model

Question

I'm having some issues saving a trained TensorFlow model, where I have a StringLookup layer and I'm required to use TFRecods as input for training.我在保存经过训练的 TensorFlow model 时遇到了一些问题，其中我有一个 StringLookup 层，并且我需要使用 TFRecods 作为训练的输入。 A minimal example to reproduce the issue:重现该问题的最小示例：

First I define the training data首先我定义训练数据

vocabulary = [str(i) for i in range(100, 200)]
X_train = np.random.choice(vocabulary, size=(100,))
y_train = np.random.choice([0,1], size=(100,))

I save it in a file as tfrecords我将它作为 tfrecords 保存在一个文件中

def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _string_feature(value):
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[str(value).encode('utf-8')]))

with tf.io.TFRecordWriter('train.tfrecords') as writer:
    for i in range(len(X_train)):
        example = tf.train.Example(features=tf.train.Features(feature={
            'user_id': _string_feature(X_train[i]),
            'label': _int64_feature(y_train[i])
        }))
        writer.write(example.SerializeToString())

Then I use the tf.data API to be able to stream the data into training (the original data doesn't fit into memory)然后我使用 tf.data API 能够将 stream 数据导入训练（原始数据不适合内存）

data = tf.data.TFRecordDataset(['train.tfrecords'])
features = {
    'user_id': tf.io.FixedLenFeature([], tf.string),
    'label': tf.io.FixedLenFeature([], tf.int64)
} 
def parse(record):
    parsed = tf.io.parse_single_example(record, features)
    return (parsed['user_id'], parsed['label'])
data = data.map(parse)

The data looks like this:数据如下所示：

print(list(data.take(5).as_numpy_iterator()))
[(b'166', 1), (b'144', 0), (b'148', 1), (b'180', 0), (b'192', 0)]

The strings of the original dataset were converted to bytes in the process.原始数据集的字符串在此过程中被转换为字节。 I have to pass this new vocabulary to the StringLookup contructor, as passing strings and training with bytes will throw an error我必须将这个新词汇表传递给 StringLookup 构造函数，因为传递字符串和使用字节进行训练会引发错误

new_vocab = [w.encode('utf-8') for w in vocabulary]

inp = tf.keras.Input(shape=(1,), dtype=tf.string)
x = tf.keras.layers.StringLookup(vocabulary=new_vocab)(inp)
x = tf.keras.layers.Embedding(len(new_vocab)+1, 32)(x)
out = tf.keras.layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs=[inp], outputs=[out])

model.compile(optimizer='adam', loss='BinaryCrossentropy')
model.fit(data.batch(10), epochs=5)

But when I try to save the model, I get an error because the vocabulary input to the StringLookup layer is encoded as bytes and can't be dumped into json但是当我尝试保存 model 时，我收到一个错误，因为输入到 StringLookup 层的词汇被编码为字节并且不能转储到 json

model.save('model/')
TypeError: ('Not JSON Serializable:', b'100')

I really don't know what to do, I read that TensorFlow recommends using encoded strings instead of normal strings but that doesn't allow to save the model.我真的不知道该怎么做，我读到 TensorFlow 建议使用编码字符串而不是普通字符串，但这不允许保存 model。 I also tried to preprocess the data decoding the strings before thay are fed to the model, but I wasn't able to do it without loading all the data into memory (using just tf.data operations)我还尝试在将字符串输入 model 之前对解码字符串的数据进行预处理，但是如果不将所有数据加载到 memory 中（仅使用 tf.data 操作），我就无法做到这一点

Answer 1

Using your data and original vocabulary:使用您的数据和原始词汇：

import tensorflow as tf
import numpy as np

vocabulary = [str(i) for i in range(100, 200)]
X_train = np.random.choice(vocabulary, size=(100,))
y_train = np.random.choice([0,1], size=(100,))
...
...
data = data.map(parse)

I ran your code (with an additional Flatten layer) and was able to save your model:我运行了您的代码（带有额外的Flatten层）并且能够保存您的 model：

inp = tf.keras.Input(shape=(1,), dtype=tf.string)
x = tf.keras.layers.StringLookup(vocabulary=vocabulary)(inp)
x = tf.keras.layers.Embedding(len(vocabulary)+1, 32)(x)
x = tf.keras.layers.Flatten()(x)
out = tf.keras.layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs=[inp], outputs=[out])

model.compile(optimizer='adam', loss='BinaryCrossentropy')
model.fit(data.batch(10), epochs=5)
model.save('model/')

Epoch 1/5
10/10 [==============================] - 1s 8ms/step - loss: 0.6949
Epoch 2/5
10/10 [==============================] - 0s 4ms/step - loss: 0.6864
Epoch 3/5
10/10 [==============================] - 0s 5ms/step - loss: 0.6787
Epoch 4/5
10/10 [==============================] - 0s 5ms/step - loss: 0.6707
Epoch 5/5
10/10 [==============================] - 0s 5ms/step - loss: 0.6620
INFO:tensorflow:Assets written to: model/assets

I do not see why you need new_vocab = [w.encode('utf-8') for w in vocabulary] .我不明白为什么需要new_vocab = [w.encode('utf-8') for w in vocabulary] 。

If you really need to use new_vocab , you can try setting it during training and afterwards setting vocabulary for saving your model, since the only difference is the encoding:如果你真的需要使用new_vocab ，你可以尝试在训练期间设置它，然后设置vocabulary来保存你的 model，因为唯一的区别是编码：

new_vocab = [w.encode('utf-8') for w in vocabulary]

lookup_layer = tf.keras.layers.StringLookup()
lookup_layer.adapt(new_vocab)
inp = tf.keras.Input(shape=(1,), dtype=tf.string)
x = lookup_layer(inp)
x = tf.keras.layers.Embedding(len(new_vocab)+1, 32)(x)
x = tf.keras.layers.Flatten()(x)
out = tf.keras.layers.Dense(1, activation='sigmoid')(x)
model = tf.keras.Model(inputs=[inp], outputs=[out])

model.compile(optimizer='adam', loss='BinaryCrossentropy')
model.fit(data.batch(10), epochs=5)
model.layers[1].adapt(vocabulary)

model.save('/model')

Admittingly, this is quite hacky.诚然，这很 hacky。

使用带有编码词汇表的 StringLookup 层保存 tensorflow model

问题描述

1 个解决方案

解决方案1
2 已采纳 2022-02-04 16:51:45

使用带有编码词汇表的 StringLookup 层保存 tensorflow model

问题描述

1 个解决方案

解决方案1 2 已采纳 2022-02-04 16:51:45

解决方案1
2 已采纳 2022-02-04 16:51:45