简体   繁体   English

如何在 map 方法中预处理和标记 TensorFlow CsvDataset?

[英]How do I preprocess and tokenize a TensorFlow CsvDataset inside the map method?

I made a TensorFlow CsvDataset , and I'm trying to tokenize the data as such:我做了一个 TensorFlow CsvDataset ,我正在尝试将数据标记为:

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from tensorflow import keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
import os
os.chdir('/home/nicolas/Documents/Datasets')

fname = 'rotten_tomatoes_reviews.csv'


def preprocess(target, inputs):
    tok = Tokenizer(num_words=5_000, lower=True)
    tok.fit_on_texts(inputs)
    vectors = tok.texts_to_sequences(inputs)
    return vectors, target


dataset = tf.data.experimental.CsvDataset(filenames=fname,
                                          record_defaults=[tf.int32, tf.string],
                                          header=True).map(preprocess)

Running this, gives the following error:运行它,会出现以下错误:

ValueError: len requires a non-scalar tensor, got one of shape Tensor("Shape:0", shape=(0,), dtype=int32) ValueError: len 需要一个非标量张量,得到一个 shape Tensor("Shape:0", shape=(0,), dtype=int32)

What I've tried : just about anything in the realm of possibilities.我尝试过的:realm 中的几乎所有可能性。 Note that everything runs if I remove the preprocessing step.请注意,如果我删除预处理步骤,一切都会运行。

What the data looks like:数据是什么样子的:

(<tf.Tensor: shape=(), dtype=int32, numpy=1>,
 <tf.Tensor: shape=(), dtype=string, numpy=b" Some movie critic review...">)

First of all, let's find out the problems in your code:首先,让我们找出代码中的问题:

  • The first problem, which is also the reason behind the given error, is that the fit_on_texts method accepts a list of texts, not a single text string.第一个问题,也是给定错误背后的原因,是fit_on_texts方法接受文本列表,而不是单个文本字符串。 Therefore, it should be: tok.fit_on_texts([inputs]) .因此,它应该是: tok.fit_on_texts([inputs])

  • After fixing that and running the code again, you would get another error: AttributeError: 'Tensor' object has no attribute 'lower' .修复该问题并再次运行代码后,您会收到另一个错误: AttributeError: 'Tensor' object has no attribute 'lower' This is due to the fact that the elements in the dataset are Tensor objects, and the map function should be able to handle them;这是因为数据集中的元素是 Tensor 对象,map function 应该能够处理它们; however, the Tokenizer class is not designed to handle Tensor objects (there is a fix for this problem, but I won't address it now because of the next problem).然而, Tokenizer class 并不是为处理张量对象而设计的(这个问题有一个修复方法,但由于下一个问题我现在不会解决它)。

  • The biggest problem is that each time the map function, ie preprocess , is called, a new instance of Tokenizer class is created and it would be fit on a single text document.最大的问题是,每次调用 map function,即preprocess时,都会创建一个新的Tokenizer class 实例,并且它将适合单个文本文档。 Update: As @Princy correctly pointed out in the comments section, the fit_on_texts method actually performs a partial fit (ie updates or augments the internal vocabulary stats, instead of starting from scratch).更新:正如@Princy在评论部分正确指出的那样, fit_on_texts方法实际上执行部分拟合(即更新或增加内部词汇统计,而不是从头开始)。 So if we create the Tokenizer class outside the preprocess function and assuming the vocabulary set is known beforehand (otherwise, you can't filter the most frequent words in a partial fit scheme unless you have or build the vocabulary set first), then it would be possible to use this approach (ie based on Tokenizer class) after applying the above fixes as well.因此,如果我们在preprocess function 之外创建Tokenizer class 并假设词汇集是事先已知的(否则,除非您首先拥有或构建词汇集,否则您无法在部分拟合方案中过滤最常见的词),那么它将也可以在应用上述修复后使用这种方法(即基于Tokenizer类)。 However, personally, I prefer the solution below.但是,就个人而言,我更喜欢下面的解决方案。


So, what should we do?那么,我们应该怎么做呢? As mentioned above, in almost all of the models which deal with text data, we first need to convert the texts into numerical features, ie encode them.如上所述,在几乎所有处理文本数据的模型中,我们首先需要将文本转换为数字特征,即对其进行编码。 For performing encoding, first we need a vocabulary set or a dictionary of tokens.为了执行编码,首先我们需要一个词汇集或一个标记字典。 Therefore, the steps we should take are as follows:因此,我们应该采取的步骤如下:

  1. If there is a pre-built vocabulary available, then skip to the next step.如果有可用的预构建词汇表,则跳到下一步。 Otherwise, tokenize all the text data first and build the vocabulary.否则,首先标记所有文本数据并构建词汇表。

  2. Encode the text data using the vocabulary set.使用词汇集对文本数据进行编码。

For performing the first step, we use tfds.features.text.Tokenizer to tokenize text data and build the vocabulary by iterating over the dataset.为了执行第一步,我们使用tfds.features.text.Tokenizer对文本数据进行标记,并通过迭代数据集来构建词汇表。

For the second step, we use tfds.features.text.TokenTextEncoder to encode the text data using the vocabulary set built in previous step.对于第二步,我们使用tfds.features.text.TokenTextEncoder使用上一步中构建的词汇集对文本数据进行编码。 Note that, for this step we are using map method;请注意,对于这一步,我们使用的是map方法; however, since map only functions in graph mode, we have wrapped our encode function in tf.py_function so that it could be used with map .但是,由于map仅在图形模式下起作用,因此我们将encode function 包装在tf.py_function中,以便它可以与map一起使用。

Here is the code (please read the comments in the code for additional points; I have not included them in the answer because they are not directly relevant, but they are useful and practical):这是代码(请阅读代码中的注释以获取更多信息;我没有将它们包含在答案中,因为它们不直接相关,但它们很有用且实用):

import tensorflow as tf
import tensorflow_datasets as tfds
from collections import Counter

fname = "rotten_tomatoes_reviews.csv"
dataset = tf.data.experimental.CsvDataset(filenames=fname,
                                          record_defaults=[tf.int32, tf.string],
                                          header=True)

# Create a tokenizer instance to tokenize text data.
tokenizer = tfds.features.text.Tokenizer()

# Find unique tokens in the dataset.
lowercase = True  # set this to `False` if case-sensitivity is important.
vocabulary = Counter()
for _, text in dataset:
    if lowercase:
       text = tf.strings.lower(text)
    tokens = tokenizer.tokenize(text.numpy())
    vocabulary.update(tokens)

# Select the most common tokens as final vocabulary set.
# Note: if you want all the tokens to be included,
# set `vocab_size = len(vocabulary)` instead.
vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))

# Create an encoder instance given our vocabulary set.
encoder = tfds.features.text.TokenTextEncoder(vocabulary,
                                              lowercase=lowercase,
                                              tokenizer=tokenizer)

# Set this to a non-zero integer if you want the texts
# to be truncated when they have more than `max_len` tokens.
max_len = None

def encode(target, text):
    text_encoded = encoder.encode(text.numpy())
    if max_len:
        text_encoded = text_encoded[:max_len]
    return text_encoded, target

# Wrap `encode` function inside `tf.py_function` so that
# it could be used with `map` method.
def encode_pyfn(target, text):
    text_encoded, target = tf.py_function(encode,
                                          inp=[target, text],
                                          Tout=(tf.int32, tf.int32))
    
    # (optional) Set the shapes for efficiency.
    text_encoded.set_shape([None])
    target.set_shape([])

    return text_encoded, target

# Apply encoding and then padding.
# Note: if you want the sequences in all the batches 
# to have the same length, set `padded_shapes` argument accordingly.
dataset = dataset.map(encode_pyfn).padded_batch(batch_size=3,
                                                padded_shapes=([None,], []))

# Important Note: probably this dataset would be used as input to a model
# which uses an Embedding layer. Therefore, don't forget that you
# should set the vocabulary size for this layer properly, i.e. the
# current value of `vocab_size` does not include the padding (added
# by `padded_batch` method) and also the OOV token (added by encoder).

Side note for future readers: notice that the order of arguments, ie target, text , and the data types are based on the OP's dataset.未来读者的旁注:注意 arguments 的顺序,即target, text和数据类型基于 OP 的数据集。 Adapt as needed based on your own dataset/task (although, at the end, ie return text_encoded, target , we adjusted this to make it compatible with expected format of fit method).根据您自己的数据集/任务根据需要进行调整(尽管最后,即return text_encoded, target ,我们对此进行了调整以使其与fit方法的预期格式兼容)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM