简体   繁体   English

如何在 tf.Dataset 上调整 TextVectorization 层

[英]How to adapt TextVectorization layer on tf.Dataset

I load my dataset like this:我像这样加载我的数据集:

self.train_ds = tf.data.experimental.make_csv_dataset(
            self.config["input_paths"]["data"]["train"],
            batch_size=self.params["batch_size"],
            shuffle=False,
            label_name="tags",
            num_epochs=1,
        )

My TextVectorization layer looks like this:我的 TextVectorization 层如下所示:

vectorizer = tf.keras.layers.TextVectorization(
            standardize=code_standaridization,
            split="whitespace",
            output_mode="int",
            output_sequence_length=params["input_dim"],
            max_tokens=100_000,
        )

And I thought this is going to be enough:我认为这就足够了:

vectorizer.adapt(data_provider.train_ds)

But its not, I have this error:但它不是,我有这个错误:

TypeError: Expected string, but got Tensor("IteratorGetNext:0", shape=(None, None), dtype=string) of type 'Tensor'.

Can I somehow adapt my vectorizer on TensorFlow dataset?我能否以某种方式在 TensorFlow 数据集上调整我的矢量化器?

Most probably the issue is that you use batch_size in your train_ds without .unbatch() when you try to adapt.最有可能的问题是,当您尝试适应时,您在没有.unbatch()train_ds中使用了batch_size

You have to do:你必须做:

vectorizer.adapt(train_ds.unbatch().map(lambda x, y: x).batch(BATCH_SIZE))

The .unbatch() solves the error that you are currently seeing and the .map() is needed because the TextVectorization layer operates on batches of strings so you need to get them from your dataset .unbatch()解决了您当前看到的错误并且需要.map() ,因为 TextVectorization 层对字符串批次进行操作,因此您需要从数据集中获取它们

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM